crawl4ai
This skill should be used when users need to scrape websites, extract structured data, handle JavaScript-heavy pages, crawl multiple URLs, or build automated web data pipelines. Includes optimized extraction patterns with schema generation for efficient, LLM-free extraction.
Packaged view
This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.
Install command
npx @skill-hub/cli install basher83-lunar-claude-crawl4ai
Repository
Skill path: .claude/skills/crawl4ai
This skill should be used when users need to scrape websites, extract structured data, handle JavaScript-heavy pages, crawl multiple URLs, or build automated web data pipelines. Includes optimized extraction patterns with schema generation for efficient, LLM-free extraction.
Open repositoryBest for
Primary workflow: Analyze Data & AI.
Technical facets: Full Stack, Data / AI.
Target audience: everyone.
License: Unknown.
Original source
Catalog source: SkillHub Club.
Repository owner: basher83.
This is still a mirrored public skill entry. Review the repository before installing into production workflows.
What it helps with
- Install crawl4ai into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
- Review https://github.com/basher83/lunar-claude before adding crawl4ai to shared team environments
- Use crawl4ai for development workflows
Works across
Favorites: 0.
Sub-skills: 0.
Aggregator: No.
Original source / Raw SKILL.md
---
name: crawl4ai
description: This skill should be used when users need to scrape websites, extract structured data, handle JavaScript-heavy pages, crawl multiple URLs, or build automated web data pipelines. Includes optimized extraction patterns with schema generation for efficient, LLM-free extraction.
---
# Crawl4AI
## Overview
This skill provides comprehensive support for web crawling and data extraction using the Crawl4AI library, including the complete SDK reference, ready-to-use scripts for common patterns, and optimized workflows for efficient data extraction.
## Quick Start
### Installation Check
```bash
# Verify installation
crawl4ai-doctor
# If issues, run setup
crawl4ai-setup
```
### Basic First Crawl
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com")
print(result.markdown[:500]) # First 500 chars
asyncio.run(main())
```
### Using Provided Scripts
```bash
# Simple markdown extraction
python scripts/basic_crawler.py https://example.com
# Batch processing
python scripts/batch_crawler.py urls.txt
# Data extraction
python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products"
```
## Core Crawling Fundamentals
### 1. Basic Crawling
Understanding the core components for any crawl:
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
# Browser configuration (controls browser behavior)
browser_config = BrowserConfig(
headless=True, # Run without GUI
viewport_width=1920,
viewport_height=1080,
user_agent="custom-agent" # Optional custom user agent
)
# Crawler configuration (controls crawl behavior)
crawler_config = CrawlerRunConfig(
page_timeout=30000, # 30 seconds timeout
screenshot=True, # Take screenshot
remove_overlay_elements=True # Remove popups/overlays
)
# Execute crawl with arun()
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://example.com",
config=crawler_config
)
# CrawlResult contains everything
print(f"Success: {result.success}")
print(f"HTML length: {len(result.html)}")
print(f"Markdown length: {len(result.markdown)}")
print(f"Links found: {len(result.links)}")
```
### 2. Configuration Deep Dive
**BrowserConfig** - Controls the browser instance:
- `headless`: Run with/without GUI
- `viewport_width/height`: Browser dimensions
- `user_agent`: Custom user agent string
- `cookies`: Pre-set cookies
- `headers`: Custom HTTP headers
**CrawlerRunConfig** - Controls each crawl:
- `page_timeout`: Maximum page load/JS execution time (ms)
- `wait_for`: CSS selector or JS condition to wait for (optional)
- `cache_mode`: Control caching behavior
- `js_code`: Execute custom JavaScript
- `screenshot`: Capture page screenshot
- `session_id`: Persist session across crawls
### 3. Content Processing
Basic content operations available in every crawl:
```python
result = await crawler.arun(url)
# Access extracted content
markdown = result.markdown # Clean markdown
html = result.html # Raw HTML
text = result.cleaned_html # Cleaned HTML
# Media and links
images = result.media["images"]
videos = result.media["videos"]
internal_links = result.links["internal"]
external_links = result.links["external"]
# Metadata
title = result.metadata["title"]
description = result.metadata["description"]
```
## Markdown Generation (Primary Use Case)
### 1. Basic Markdown Extraction
Crawl4AI excels at generating clean, well-formatted markdown:
```python
# Simple markdown extraction
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://docs.example.com")
# High-quality markdown ready for LLMs
with open("documentation.md", "w") as f:
f.write(result.markdown)
```
### 2. Fit Markdown (Content Filtering)
Use content filters to get only relevant content:
```python
from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
# Option 1: Pruning filter (removes low-quality content)
pruning_filter = PruningContentFilter(threshold=0.4, threshold_type="fixed")
# Option 2: BM25 filter (relevance-based filtering)
bm25_filter = BM25ContentFilter(user_query="machine learning tutorials", bm25_threshold=1.0)
md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter)
config = CrawlerRunConfig(markdown_generator=md_generator)
result = await crawler.arun(url, config=config)
# Access filtered content
print(result.markdown.fit_markdown) # Filtered markdown
print(result.markdown.raw_markdown) # Original markdown
```
### 3. Markdown Customization
Control markdown generation with options:
```python
config = CrawlerRunConfig(
# Exclude elements from markdown
excluded_tags=["nav", "footer", "aside"],
# Focus on specific CSS selector
css_selector=".main-content",
# Clean up formatting
remove_forms=True,
remove_overlay_elements=True,
# Control link handling
exclude_external_links=True,
exclude_internal_links=False
)
# Custom markdown generation
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
generator = DefaultMarkdownGenerator(
options={
"ignore_links": False,
"ignore_images": False,
"image_alt_text": True
}
)
```
## Data Extraction
### 1. Schema-Based Extraction (Most Efficient)
For repetitive patterns, generate schema once and reuse:
```bash
# Step 1: Generate schema with LLM (one-time)
python scripts/extraction_pipeline.py --generate-schema https://shop.com "extract products"
# Step 2: Use schema for fast extraction (no LLM)
python scripts/extraction_pipeline.py --use-schema https://shop.com generated_schema.json
```
### 2. Manual CSS/JSON Extraction
When you know the structure:
```python
schema = {
"name": "articles",
"baseSelector": "article.post",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "date", "selector": ".date", "type": "text"},
{"name": "content", "selector": ".content", "type": "text"}
]
}
extraction_strategy = JsonCssExtractionStrategy(schema=schema)
config = CrawlerRunConfig(extraction_strategy=extraction_strategy)
```
### 3. LLM-Based Extraction
For complex or irregular content:
```python
extraction_strategy = LLMExtractionStrategy(
provider="openai/gpt-4o-mini",
instruction="Extract key financial metrics and quarterly trends"
)
```
## Advanced Patterns
### 1. Deep Crawling
Discover and crawl links from a page:
```python
# Basic link discovery
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url)
# Extract and process discovered links
internal_links = result.links.get("internal", [])
external_links = result.links.get("external", [])
# Crawl discovered internal links
for link in internal_links:
if "/blog/" in link and "/tag/" not in link: # Filter links
sub_result = await crawler.arun(link)
# Process sub-page
# For advanced deep crawling, consider using URL seeding patterns
# or custom crawl strategies (see complete-sdk-reference.md)
```
### 2. Batch & Multi-URL Processing
Efficiently crawl multiple URLs:
```python
urls = ["https://site1.com", "https://site2.com", "https://site3.com"]
async with AsyncWebCrawler() as crawler:
# Concurrent crawling with arun_many()
results = await crawler.arun_many(
urls=urls,
config=crawler_config,
max_concurrent=5 # Control concurrency
)
for result in results:
if result.success:
print(f"✅ {result.url}: {len(result.markdown)} chars")
```
### 3. Session & Authentication
Handle login-required content:
```python
# First crawl - establish session and login
login_config = CrawlerRunConfig(
session_id="user_session",
js_code="""
document.querySelector('#username').value = 'myuser';
document.querySelector('#password').value = 'mypass';
document.querySelector('#submit').click();
""",
wait_for="css:.dashboard" # Wait for post-login element
)
await crawler.arun("https://site.com/login", config=login_config)
# Subsequent crawls - reuse session
config = CrawlerRunConfig(session_id="user_session")
await crawler.arun("https://site.com/protected-content", config=config)
```
### 4. Dynamic Content Handling
For JavaScript-heavy sites:
```python
config = CrawlerRunConfig(
# Wait for dynamic content
wait_for="css:.ajax-content",
# Execute JavaScript
js_code="""
// Scroll to load content
window.scrollTo(0, document.body.scrollHeight);
// Click load more button
document.querySelector('.load-more')?.click();
""",
# Note: For virtual scrolling (Twitter/Instagram-style),
# use virtual_scroll_config parameter (see docs)
# Extended timeout for slow loading
page_timeout=60000
)
```
### 5. Anti-Detection & Proxies
Avoid bot detection:
```python
# Proxy configuration
browser_config = BrowserConfig(
headless=True,
proxy_config={
"server": "http://proxy.server:8080",
"username": "user",
"password": "pass"
}
)
# For stealth/undetected browsing, consider:
# - Rotating user agents via user_agent parameter
# - Using different viewport sizes
# - Adding delays between requests
# Rate limiting
import asyncio
for url in urls:
result = await crawler.arun(url)
await asyncio.sleep(2) # Delay between requests
```
## Common Use Cases
### Documentation to Markdown
```python
# Convert entire documentation site to clean markdown
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://docs.example.com")
# Save as markdown for LLM consumption
with open("docs.md", "w") as f:
f.write(result.markdown)
```
### E-commerce Product Monitoring
```python
# Generate schema once for product pages
# Then monitor prices/availability without LLM costs
schema = load_json("product_schema.json")
products = await crawler.arun_many(product_urls,
config=CrawlerRunConfig(extraction_strategy=JsonCssExtractionStrategy(schema)))
```
### News Aggregation
```python
# Crawl multiple news sources concurrently
news_urls = ["https://news1.com", "https://news2.com", "https://news3.com"]
results = await crawler.arun_many(news_urls, max_concurrent=5)
# Extract articles with Fit Markdown
for result in results:
if result.success:
# Get only relevant content
article = result.fit_markdown
```
### Research & Data Collection
```python
# Academic paper collection with focused extraction
config = CrawlerRunConfig(
fit_markdown=True,
fit_markdown_options={
"query": "machine learning transformers",
"max_tokens": 10000
}
)
```
## Resources
### scripts/
- **extraction_pipeline.py** - Three extraction approaches with schema generation
- **basic_crawler.py** - Simple markdown extraction with screenshots
- **batch_crawler.py** - Multi-URL concurrent processing
### references/
- **complete-sdk-reference.md** - Complete SDK documentation (23K words) with all parameters, methods, and advanced features
### Example Code Repository
The Crawl4AI repository includes extensive examples in `docs/examples/`:
#### Core Examples
- **quickstart.py** - Comprehensive starter with all basic patterns:
- Simple crawling, JavaScript execution, CSS selectors
- Content filtering, link analysis, media handling
- LLM extraction, CSS extraction, dynamic content
- Browser comparison, SSL certificates
#### Specialized Examples
- **amazon_product_extraction_*.py** - Three approaches for e-commerce scraping
- **extraction_strategies_examples.py** - All extraction strategies demonstrated
- **deepcrawl_example.py** - Advanced deep crawling patterns
- **crypto_analysis_example.py** - Complex data extraction with analysis
- **parallel_execution_example.py** - High-performance concurrent crawling
- **session_management_example.py** - Authentication and session handling
- **markdown_generation_example.py** - Advanced markdown customization
- **hooks_example.py** - Custom hooks for crawl lifecycle events
- **proxy_rotation_example.py** - Proxy management and rotation
- **router_example.py** - Request routing and URL patterns
#### Advanced Patterns
- **adaptive_crawling/** - Intelligent crawling strategies
- **c4a_script/** - C4A script examples
- **docker_*.py** - Docker deployment patterns
To explore examples:
```python
# The examples are located in your Crawl4AI installation:
# Look in: docs/examples/ directory
# Start with quickstart.py for comprehensive patterns
# It includes: simple crawl, JS execution, CSS selectors,
# content filtering, LLM extraction, dynamic pages, and more
# For specific use cases:
# - E-commerce: amazon_product_extraction_*.py
# - High performance: parallel_execution_example.py
# - Authentication: session_management_example.py
# - Deep crawling: deepcrawl_example.py
# Run any example directly:
# python docs/examples/quickstart.py
```
## Best Practices
1. **Start with basic crawling** - Understand BrowserConfig, CrawlerRunConfig, and arun() before moving to advanced features
2. **Use markdown generation** for documentation and content - Crawl4AI excels at clean markdown extraction
3. **Try schema generation first** for structured data - 10-100x more efficient than LLM extraction
4. **Enable caching during development** - `cache_mode=CacheMode.ENABLED` to avoid repeated requests
5. **Set appropriate timeouts** - 30s for normal sites, 60s+ for JavaScript-heavy sites
6. **Respect rate limits** - Use delays and `max_concurrent` parameter
7. **Reuse sessions** for authenticated content instead of re-logging
## Troubleshooting
**JavaScript not loading:**
```python
config = CrawlerRunConfig(
wait_for="css:.dynamic-content", # Wait for specific element
page_timeout=60000 # Increase timeout
)
```
**Bot detection issues:**
```python
browser_config = BrowserConfig(
headless=False, # Sometimes visible browsing helps
viewport_width=1920,
viewport_height=1080,
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
)
# Add delays between requests
await asyncio.sleep(random.uniform(2, 5))
```
**Content extraction problems:**
```python
# Debug what's being extracted
result = await crawler.arun(url)
print(f"HTML length: {len(result.html)}")
print(f"Markdown length: {len(result.markdown)}")
print(f"Links found: {len(result.links)}")
# Try different wait strategies
config = CrawlerRunConfig(
wait_for="js:document.querySelector('.content') !== null"
)
```
**Session/auth issues:**
```python
# Verify session is maintained
config = CrawlerRunConfig(session_id="test_session")
result = await crawler.arun(url, config=config)
print(f"Session ID: {result.session_id}")
print(f"Cookies: {result.cookies}")
```
For more details on any topic, refer to `references/complete-sdk-reference.md` which contains comprehensive documentation of all features, parameters, and advanced usage patterns.
---
## Referenced Files
> The following files are referenced in this skill and included for context.
### scripts/basic_crawler.py
```python
#!/usr/bin/env python3
"""
Basic Crawl4AI crawler template
Usage: python basic_crawler.py <url>
"""
import asyncio
import sys
# Version check
MIN_CRAWL4AI_VERSION = "0.7.4"
try:
from crawl4ai.__version__ import __version__
from packaging import version
if version.parse(__version__) < version.parse(MIN_CRAWL4AI_VERSION):
print(f"⚠️ Warning: Crawl4AI {MIN_CRAWL4AI_VERSION}+ recommended (you have {__version__})")
except ImportError:
print(f"ℹ️ Crawl4AI {MIN_CRAWL4AI_VERSION}+ required")
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
async def crawl_basic(url: str):
"""Basic crawling with markdown output"""
# Configure browser
browser_config = BrowserConfig(
headless=True,
viewport_width=1920,
viewport_height=1080
)
# Configure crawler
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
remove_overlay_elements=True,
wait_for_images=True,
screenshot=True
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url=url,
config=crawler_config
)
if result.success:
print(f"✅ Crawled: {result.url}")
print(f" Title: {result.metadata.get('title', 'N/A')}")
print(f" Links found: {len(result.links.get('internal', []))} internal, {len(result.links.get('external', []))} external")
print(f" Media found: {len(result.media.get('images', []))} images, {len(result.media.get('videos', []))} videos")
print(f" Content length: {len(result.markdown)} chars")
# Save markdown
with open("output.md", "w") as f:
f.write(result.markdown)
print("📄 Saved to output.md")
# Save screenshot if available
if result.screenshot:
# Check if screenshot is base64 string or bytes
if isinstance(result.screenshot, str):
import base64
screenshot_data = base64.b64decode(result.screenshot)
else:
screenshot_data = result.screenshot
with open("screenshot.png", "wb") as f:
f.write(screenshot_data)
print("📸 Saved screenshot.png")
else:
print(f"❌ Failed: {result.error_message}")
return result
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python basic_crawler.py <url>")
sys.exit(1)
url = sys.argv[1]
asyncio.run(crawl_basic(url))
```
### scripts/batch_crawler.py
```python
#!/usr/bin/env python3
"""
Crawl4AI batch/multi-URL crawler with concurrent processing
Usage: python batch_crawler.py urls.txt [--max-concurrent 5]
"""
import asyncio
import sys
import json
from pathlib import Path
from typing import List, Dict, Any
# Version check
MIN_CRAWL4AI_VERSION = "0.7.4"
try:
from crawl4ai.__version__ import __version__
from packaging import version
if version.parse(__version__) < version.parse(MIN_CRAWL4AI_VERSION):
print(f"⚠️ Warning: Crawl4AI {MIN_CRAWL4AI_VERSION}+ recommended (you have {__version__})")
except ImportError:
print(f"ℹ️ Crawl4AI {MIN_CRAWL4AI_VERSION}+ required")
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
async def crawl_batch(urls: List[str], max_concurrent: int = 5):
"""
Crawl multiple URLs efficiently with concurrent processing
"""
print(f"🚀 Starting batch crawl of {len(urls)} URLs (max {max_concurrent} concurrent)")
# Configure browser for efficiency
browser_config = BrowserConfig(
headless=True,
viewport_width=1280,
viewport_height=800,
verbose=False
)
# Configure crawler
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
remove_overlay_elements=True,
wait_for="css:body",
page_timeout=30000, # 30 seconds timeout per page
screenshot=False # Disable screenshots for batch processing
)
results = []
failed = []
async with AsyncWebCrawler(config=browser_config) as crawler:
# Use arun_many for efficient batch processing
batch_results = await crawler.arun_many(
urls=urls,
config=crawler_config,
max_concurrent=max_concurrent
)
for result in batch_results:
if result.success:
results.append({
"url": result.url,
"title": result.metadata.get("title", ""),
"description": result.metadata.get("description", ""),
"content_length": len(result.markdown),
"links_count": len(result.links.get("internal", [])) + len(result.links.get("external", [])),
"images_count": len(result.media.get("images", [])),
})
print(f"✅ {result.url}")
else:
failed.append({
"url": result.url,
"error": result.error_message
})
print(f"❌ {result.url}: {result.error_message}")
# Save results
output = {
"success_count": len(results),
"failed_count": len(failed),
"results": results,
"failed": failed
}
with open("batch_results.json", "w") as f:
json.dump(output, f, indent=2)
# Save individual markdown files
markdown_dir = Path("batch_markdown")
markdown_dir.mkdir(exist_ok=True)
for i, result in enumerate(batch_results):
if result.success:
# Create safe filename from URL
safe_name = result.url.replace("https://", "").replace("http://", "")
safe_name = "".join(c if c.isalnum() or c in "-_" else "_" for c in safe_name)[:100]
file_path = markdown_dir / f"{i:03d}_{safe_name}.md"
with open(file_path, "w") as f:
f.write(f"# {result.metadata.get('title', result.url)}\n\n")
f.write(f"URL: {result.url}\n\n")
f.write(result.markdown)
print(f"\n📊 Batch Crawl Complete:")
print(f" ✅ Success: {len(results)}")
print(f" ❌ Failed: {len(failed)}")
print(f" 💾 Results saved to: batch_results.json")
print(f" 📁 Markdown files saved to: {markdown_dir}/")
return output
async def crawl_with_extraction(urls: List[str], schema_file: str = None):
"""
Batch crawl with structured data extraction
"""
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
schema = None
if schema_file and Path(schema_file).exists():
with open(schema_file) as f:
schema = json.load(f)
print(f"📋 Using extraction schema from: {schema_file}")
else:
# Default schema for general content
schema = {
"name": "content",
"selector": "body",
"fields": [
{"name": "headings", "selector": "h1, h2, h3", "type": "text", "all": True},
{"name": "paragraphs", "selector": "p", "type": "text", "all": True},
{"name": "links", "selector": "a[href]", "type": "attribute", "attribute": "href", "all": True}
]
}
extraction_strategy = JsonCssExtractionStrategy(schema=schema)
crawler_config = CrawlerRunConfig(
extraction_strategy=extraction_strategy,
cache_mode=CacheMode.BYPASS
)
extracted_data = []
async with AsyncWebCrawler() as crawler:
results = await crawler.arun_many(
urls=urls,
config=crawler_config,
max_concurrent=5
)
for result in results:
if result.success and result.extracted_content:
try:
data = json.loads(result.extracted_content)
extracted_data.append({
"url": result.url,
"data": data
})
print(f"✅ Extracted from: {result.url}")
except json.JSONDecodeError:
print(f"⚠️ Failed to parse JSON from: {result.url}")
# Save extracted data
with open("batch_extracted.json", "w") as f:
json.dump(extracted_data, f, indent=2)
print(f"\n💾 Extracted data saved to: batch_extracted.json")
return extracted_data
def load_urls(source: str) -> List[str]:
"""Load URLs from file or string"""
if Path(source).exists():
with open(source) as f:
urls = [line.strip() for line in f if line.strip() and not line.startswith("#")]
else:
# Treat as comma-separated URLs
urls = [url.strip() for url in source.split(",") if url.strip()]
return urls
async def main():
if len(sys.argv) < 2:
print("""
Crawl4AI Batch Crawler
Usage:
# Crawl URLs from file
python batch_crawler.py urls.txt [--max-concurrent 5]
# Crawl with extraction
python batch_crawler.py urls.txt --extract [schema.json]
# Crawl comma-separated URLs
python batch_crawler.py "https://example.com,https://example.org"
Options:
--max-concurrent N Max concurrent crawls (default: 5)
--extract [schema] Extract structured data using schema
Example urls.txt:
https://example.com
https://example.org
# Comments are ignored
https://another-site.com
""")
sys.exit(1)
source = sys.argv[1]
urls = load_urls(source)
if not urls:
print("❌ No URLs found")
sys.exit(1)
print(f"📋 Loaded {len(urls)} URLs")
# Parse options
max_concurrent = 5
extract_mode = False
schema_file = None
for i, arg in enumerate(sys.argv[2:], 2):
if arg == "--max-concurrent" and i + 1 < len(sys.argv):
max_concurrent = int(sys.argv[i + 1])
elif arg == "--extract":
extract_mode = True
if i + 1 < len(sys.argv) and not sys.argv[i + 1].startswith("--"):
schema_file = sys.argv[i + 1]
if extract_mode:
await crawl_with_extraction(urls, schema_file)
else:
await crawl_batch(urls, max_concurrent)
if __name__ == "__main__":
asyncio.run(main())
```
### scripts/extraction_pipeline.py
```python
#!/usr/bin/env python3
"""
Crawl4AI extraction pipeline - Three approaches:
1. Generate schema with LLM (one-time) then use CSS extraction (most efficient)
2. Manual CSS/JSON schema extraction
3. Direct LLM extraction (for complex/irregular content)
Usage examples:
Generate schema: python extraction_pipeline.py --generate-schema <url> "<instruction>"
Use generated schema: python extraction_pipeline.py --use-schema <url> schema.json
Manual CSS: python extraction_pipeline.py --css <url> "<css_selector>"
Direct LLM: python extraction_pipeline.py --llm <url> "<instruction>"
"""
import asyncio
import sys
import json
from pathlib import Path
# Version check
MIN_CRAWL4AI_VERSION = "0.7.4"
try:
from crawl4ai.__version__ import __version__
from packaging import version
if version.parse(__version__) < version.parse(MIN_CRAWL4AI_VERSION):
print(f"⚠️ Warning: Crawl4AI {MIN_CRAWL4AI_VERSION}+ recommended (you have {__version__})")
except ImportError:
print(f"ℹ️ Crawl4AI {MIN_CRAWL4AI_VERSION}+ required")
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai.extraction_strategy import (
LLMExtractionStrategy,
JsonCssExtractionStrategy,
CosineStrategy
)
# =============================================================================
# APPROACH 1: Generate Schema (Most Efficient for Repetitive Patterns)
# =============================================================================
async def generate_schema(url: str, instruction: str, output_file: str = "generated_schema.json"):
"""
Step 1: Generate a reusable schema using LLM (one-time cost)
Best for: E-commerce sites, blogs, news sites with repetitive patterns
"""
print("🔍 Generating extraction schema using LLM...")
browser_config = BrowserConfig(headless=True)
# Use LLM to analyze the page structure and generate schema
extraction_strategy = LLMExtractionStrategy(
provider="openai/gpt-4o-mini", # Can use any LLM provider
instruction=f"""
Analyze this webpage and generate a CSS/JSON extraction schema.
Task: {instruction}
Return a JSON schema with CSS selectors that can extract the required data.
Format:
{{
"name": "items",
"selector": "main_container_selector",
"fields": [
{{"name": "field1", "selector": "css_selector", "type": "text"}},
{{"name": "field2", "selector": "css_selector", "type": "link"}},
// more fields...
]
}}
Make selectors as specific as possible to avoid false matches.
"""
)
crawler_config = CrawlerRunConfig(
extraction_strategy=extraction_strategy,
wait_for="css:body",
remove_overlay_elements=True
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url=url, config=crawler_config)
if result.success and result.extracted_content:
try:
# Parse and save the generated schema
schema = json.loads(result.extracted_content)
# Validate and enhance schema
if "name" not in schema:
schema["name"] = "items"
if "fields" not in schema:
print("⚠️ Generated schema missing fields, using fallback")
schema = {
"name": "items",
"baseSelector": "div.item, article, .product",
"fields": [
{"name": "title", "selector": "h1, h2, h3", "type": "text"},
{"name": "description", "selector": "p", "type": "text"},
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
]
}
# Save schema
with open(output_file, "w") as f:
json.dump(schema, f, indent=2)
print(f"✅ Schema generated and saved to: {output_file}")
print(f"📋 Schema structure:")
print(json.dumps(schema, indent=2))
return schema
except json.JSONDecodeError as e:
print(f"❌ Failed to parse generated schema: {e}")
print("Raw output:", result.extracted_content[:500])
return None
else:
print(f"❌ Failed to generate schema: {result.error_message if result else 'Unknown error'}")
return None
async def use_generated_schema(url: str, schema_file: str):
"""
Step 2: Use the generated schema for fast, repeated extractions
No LLM calls needed - pure CSS extraction
"""
print(f"📂 Loading schema from: {schema_file}")
try:
with open(schema_file, "r") as f:
schema = json.load(f)
except FileNotFoundError:
print(f"❌ Schema file not found: {schema_file}")
print("💡 Generate a schema first using: python extraction_pipeline.py --generate-schema <url> \"<instruction>\"")
return None
print("🚀 Extracting data using generated schema (no LLM calls)...")
extraction_strategy = JsonCssExtractionStrategy(
schema=schema,
verbose=True
)
crawler_config = CrawlerRunConfig(
extraction_strategy=extraction_strategy,
wait_for="css:body"
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url=url, config=crawler_config)
if result.success and result.extracted_content:
data = json.loads(result.extracted_content)
items = data.get(schema.get("name", "items"), [])
print(f"✅ Extracted {len(items)} items using schema")
# Save results
with open("extracted_data.json", "w") as f:
json.dump(data, f, indent=2)
print("💾 Saved to extracted_data.json")
# Show sample
if items:
print("\n📋 Sample (first item):")
print(json.dumps(items[0], indent=2))
return data
else:
print(f"❌ Extraction failed: {result.error_message if result else 'Unknown error'}")
return None
# =============================================================================
# APPROACH 2: Manual Schema Definition
# =============================================================================
async def extract_with_manual_schema(url: str, schema: dict = None):
"""
Use a manually defined CSS/JSON schema
Best for: When you know the exact structure of the website
"""
if not schema:
# Example schema for general content extraction
schema = {
"name": "content",
"baseSelector": "body", # Changed from 'selector' to 'baseSelector'
"fields": [
{"name": "title", "selector": "h1", "type": "text"},
{"name": "paragraphs", "selector": "p", "type": "text", "all": True},
{"name": "links", "selector": "a", "type": "attribute", "attribute": "href", "all": True}
]
}
print("📐 Using manual CSS/JSON schema for extraction...")
extraction_strategy = JsonCssExtractionStrategy(
schema=schema,
verbose=True
)
crawler_config = CrawlerRunConfig(
extraction_strategy=extraction_strategy
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url=url, config=crawler_config)
if result.success and result.extracted_content:
data = json.loads(result.extracted_content)
# Handle both list and dict formats
if isinstance(data, list):
items = data
else:
items = data.get(schema["name"], [])
print(f"✅ Extracted {len(items)} items using manual schema")
with open("manual_extracted.json", "w") as f:
json.dump(data, f, indent=2)
print("💾 Saved to manual_extracted.json")
return data
else:
print(f"❌ Extraction failed")
return None
# =============================================================================
# APPROACH 3: Direct LLM Extraction
# =============================================================================
async def extract_with_llm(url: str, instruction: str):
"""
Direct LLM extraction - uses LLM for every request
Best for: Complex, irregular content or one-time extractions
Note: Most expensive approach, use sparingly
"""
print("🤖 Using direct LLM extraction...")
browser_config = BrowserConfig(headless=True)
extraction_strategy = LLMExtractionStrategy(
provider="openai/gpt-4o-mini", # Can change to ollama/llama3, anthropic/claude, etc.
instruction=instruction,
schema={
"type": "object",
"properties": {
"items": {
"type": "array",
"items": {"type": "object"}
},
"summary": {"type": "string"}
}
}
)
crawler_config = CrawlerRunConfig(
extraction_strategy=extraction_strategy,
wait_for="css:body",
remove_overlay_elements=True
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url=url, config=crawler_config)
if result.success and result.extracted_content:
try:
data = json.loads(result.extracted_content)
items = data.get('items', [])
print(f"✅ LLM extracted {len(items)} items")
print(f"📝 Summary: {data.get('summary', 'N/A')}")
with open("llm_extracted.json", "w") as f:
json.dump(data, f, indent=2)
print("💾 Saved to llm_extracted.json")
if items:
print("\n📋 Sample (first item):")
print(json.dumps(items[0], indent=2))
return data
except json.JSONDecodeError:
print("⚠️ Could not parse LLM output as JSON")
print(result.extracted_content[:500])
return None
else:
print(f"❌ LLM extraction failed")
return None
# =============================================================================
# Main CLI Interface
# =============================================================================
async def main():
if len(sys.argv) < 3:
print("""
Crawl4AI Extraction Pipeline - Three Approaches
1️⃣ GENERATE & USE SCHEMA (Most Efficient for Repetitive Patterns):
Step 1: Generate schema (one-time LLM cost)
python extraction_pipeline.py --generate-schema <url> "<what to extract>"
Step 2: Use schema for fast extraction (no LLM)
python extraction_pipeline.py --use-schema <url> generated_schema.json
2️⃣ MANUAL SCHEMA (When You Know the Structure):
python extraction_pipeline.py --manual <url>
(Edit the schema in the script for your needs)
3️⃣ DIRECT LLM (For Complex/Irregular Content):
python extraction_pipeline.py --llm <url> "<extraction instruction>"
Examples:
# E-commerce products
python extraction_pipeline.py --generate-schema https://shop.com "Extract all products with name, price, image"
python extraction_pipeline.py --use-schema https://shop.com generated_schema.json
# News articles
python extraction_pipeline.py --generate-schema https://news.com "Extract headlines, dates, and summaries"
# Complex content
python extraction_pipeline.py --llm https://complex-site.com "Extract financial data and quarterly reports"
""")
sys.exit(1)
mode = sys.argv[1]
url = sys.argv[2]
if mode == "--generate-schema":
if len(sys.argv) < 4:
print("Error: Missing extraction instruction")
print("Usage: python extraction_pipeline.py --generate-schema <url> \"<instruction>\"")
sys.exit(1)
instruction = sys.argv[3]
output_file = sys.argv[4] if len(sys.argv) > 4 else "generated_schema.json"
await generate_schema(url, instruction, output_file)
elif mode == "--use-schema":
if len(sys.argv) < 4:
print("Error: Missing schema file")
print("Usage: python extraction_pipeline.py --use-schema <url> <schema.json>")
sys.exit(1)
schema_file = sys.argv[3]
await use_generated_schema(url, schema_file)
elif mode == "--manual":
await extract_with_manual_schema(url)
elif mode == "--llm":
if len(sys.argv) < 4:
print("Error: Missing extraction instruction")
print("Usage: python extraction_pipeline.py --llm <url> \"<instruction>\"")
sys.exit(1)
instruction = sys.argv[3]
await extract_with_llm(url, instruction)
else:
print(f"Unknown mode: {mode}")
print("Use --generate-schema, --use-schema, --manual, or --llm")
sys.exit(1)
if __name__ == "__main__":
asyncio.run(main())
```
### references/complete-sdk-reference.md
```markdown
# Crawl4AI Complete SDK Documentation
**Generated:** 2025-10-19 12:56
**Format:** Ultra-Dense Reference (Optimized for AI Assistants)
**Crawl4AI Version:** 0.7.4
---
## Navigation
- [Installation & Setup](#installation--setup)
- [Quick Start](#quick-start)
- [Core API](#core-api)
- [Configuration](#configuration)
- [Crawling Patterns](#crawling-patterns)
- [Content Processing](#content-processing)
- [Extraction Strategies](#extraction-strategies)
- [Advanced Features](#advanced-features)
---
## Installation & Setup
## Installation & Setup (2023 Edition)
## 1. Basic Installation
```bash
pip install crawl4ai
```
## 2. Initial Setup & Diagnostics
### 2.1 Run the Setup Command
```bash
crawl4ai-setup
```
- Performs OS-level checks (e.g., missing libs on Linux)
- Confirms your environment is ready to crawl
### 2.2 Diagnostics
```bash
crawl4ai-doctor
```
- Check Python version compatibility
- Verify Playwright installation
- Inspect environment variables or library conflicts
If any issues arise, follow its suggestions (e.g., installing additional system packages) and re-run `crawl4ai-setup`.
## 3. Verifying Installation: A Simple Crawl (Skip this step if you already run `crawl4ai-doctor`)
Below is a minimal Python script demonstrating a **basic** crawl. It uses our new **`BrowserConfig`** and **`CrawlerRunConfig`** for clarity, though no custom settings are passed in this example:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://www.example.com",
)
print(result.markdown[:300]) # Show the first 300 characters of extracted text
if __name__ == "__main__":
asyncio.run(main())
```
- A headless browser session loads `example.com`
- Crawl4AI returns ~300 characters of markdown.
If errors occur, rerun `crawl4ai-doctor` or manually ensure Playwright is installed correctly.
## 4. Advanced Installation (Optional)
### 4.1 Torch, Transformers, or All
- **Text Clustering (Torch)**
```bash
pip install crawl4ai[torch]
crawl4ai-setup
```
- **Transformers**
```bash
pip install crawl4ai[transformer]
crawl4ai-setup
```
- **All Features**
```bash
pip install crawl4ai[all]
crawl4ai-setup
```
```bash
crawl4ai-download-models
```
## 5. Docker (Experimental)
```bash
docker pull unclecode/crawl4ai:basic
docker run -p 11235:11235 unclecode/crawl4ai:basic
```
You can then make POST requests to `http://localhost:11235/crawl` to perform crawls. **Production usage** is discouraged until our new Docker approach is ready (planned in Jan or Feb 2025).
## 6. Local Server Mode (Legacy)
## Summary
1. **Install** with `pip install crawl4ai` and run `crawl4ai-setup`.
2. **Diagnose** with `crawl4ai-doctor` if you see errors.
3. **Verify** by crawling `example.com` with minimal `BrowserConfig` + `CrawlerRunConfig`.
## Quick Start
## Getting Started with Crawl4AI
1. Run your **first crawl** using minimal configuration.
2. Experiment with a simple **CSS-based extraction** strategy.
3. Crawl a **dynamic** page that loads content via JavaScript.
## 1. Introduction
- An asynchronous crawler, **`AsyncWebCrawler`**.
- Configurable browser and run settings via **`BrowserConfig`** and **`CrawlerRunConfig`**.
- Automatic HTML-to-Markdown conversion via **`DefaultMarkdownGenerator`** (supports optional filters).
- Multiple extraction strategies (LLM-based or “traditional” CSS/XPath-based).
## 2. Your First Crawl
Here’s a minimal Python script that creates an **`AsyncWebCrawler`**, fetches a webpage, and prints the first 300 characters of its Markdown output:
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com")
print(result.markdown[:300]) # Print first 300 chars
if __name__ == "__main__":
asyncio.run(main())
```
- **`AsyncWebCrawler`** launches a headless browser (Chromium by default).
- It fetches `https://example.com`.
- Crawl4AI automatically converts the HTML into Markdown.
## 3. Basic Configuration (Light Introduction)
1. **`BrowserConfig`**: Controls browser behavior (headless or full UI, user agent, JavaScript toggles, etc.).
2. **`CrawlerRunConfig`**: Controls how each crawl runs (caching, extraction, timeouts, hooking, etc.).
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
async def main():
browser_conf = BrowserConfig(headless=True) # or False to see the browser
run_conf = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler(config=browser_conf) as crawler:
result = await crawler.arun(
url="https://example.com",
config=run_conf
)
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
```
> IMPORTANT: By default cache mode is set to `CacheMode.BYPASS` to have fresh content. Set `CacheMode.ENABLED` to enable caching.
## 4. Generating Markdown Output
- **`result.markdown`**:
- **`result.markdown.fit_markdown`**:
The same content after applying any configured **content filter** (e.g., `PruningContentFilter`).
### Example: Using a Filter with `DefaultMarkdownGenerator`
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
md_generator = DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
)
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
markdown_generator=md_generator
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://news.ycombinator.com", config=config)
print("Raw Markdown length:", len(result.markdown.raw_markdown))
print("Fit Markdown length:", len(result.markdown.fit_markdown))
```
**Note**: If you do **not** specify a content filter or markdown generator, you’ll typically see only the raw Markdown. `PruningContentFilter` may adds around `50ms` in processing time. We’ll dive deeper into these strategies in a dedicated **Markdown Generation** tutorial.
## 5. Simple Data Extraction (CSS-based)
```python
from crawl4ai import JsonCssExtractionStrategy
from crawl4ai import LLMConfig
# Generate a schema (one-time cost)
html = "<div class='product'><h2>Gaming Laptop</h2><span class='price'>$999.99</span></div>"
# Using OpenAI (requires API token)
schema = JsonCssExtractionStrategy.generate_schema(
html,
llm_config = LLMConfig(provider="openai/gpt-4o",api_token="your-openai-token") # Required for OpenAI
)
# Or using Ollama (open source, no token needed)
schema = JsonCssExtractionStrategy.generate_schema(
html,
llm_config = LLMConfig(provider="ollama/llama3.3", api_token=None) # Not needed for Ollama
)
# Use the schema for fast, repeated extractions
strategy = JsonCssExtractionStrategy(schema)
```
```python
import asyncio
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai import JsonCssExtractionStrategy
async def main():
schema = {
"name": "Example Items",
"baseSelector": "div.item",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
]
}
raw_html = "<div class='item'><h2>Item 1</h2><a href='https://example.com/item1'>Link 1</a></div>"
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="raw://" + raw_html,
config=CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=JsonCssExtractionStrategy(schema)
)
)
# The JSON output is stored in 'extracted_content'
data = json.loads(result.extracted_content)
print(data)
if __name__ == "__main__":
asyncio.run(main())
```
- Great for repetitive page structures (e.g., item listings, articles).
- No AI usage or costs.
- The crawler returns a JSON string you can parse or store.
> Tips: You can pass raw HTML to the crawler instead of a URL. To do so, prefix the HTML with `raw://`.
## 6. Simple Data Extraction (LLM-based)
- **Open-Source Models** (e.g., `ollama/llama3.3`, `no_token`)
- **OpenAI Models** (e.g., `openai/gpt-4`, requires `api_token`)
- Or any provider supported by the underlying library
```python
import os
import json
import asyncio
from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
from crawl4ai import LLMExtractionStrategy
class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.")
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
output_fee: str = Field(
..., description="Fee for output token for the OpenAI model."
)
async def extract_structured_data_using_llm(
provider: str, api_token: str = None, extra_headers: Dict[str, str] = None
):
print(f"\n--- Extracting Structured Data with {provider} ---")
if api_token is None and provider != "ollama":
print(f"API token is required for {provider}. Skipping this example.")
return
browser_config = BrowserConfig(headless=True)
extra_args = {"temperature": 0, "top_p": 0.9, "max_tokens": 2000}
if extra_headers:
extra_args["extra_headers"] = extra_headers
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
word_count_threshold=1,
page_timeout=80000,
extraction_strategy=LLMExtractionStrategy(
llm_config = LLMConfig(provider=provider,api_token=api_token),
schema=OpenAIModelFee.model_json_schema(),
extraction_type="schema",
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
Do not miss any models in the entire content.""",
extra_args=extra_args,
),
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://openai.com/api/pricing/", config=crawler_config
)
print(result.extracted_content)
if __name__ == "__main__":
asyncio.run(
extract_structured_data_using_llm(
provider="openai/gpt-4o", api_token=os.getenv("OPENAI_API_KEY")
)
)
```
- We define a Pydantic schema (`PricingInfo`) describing the fields we want.
## 7. Adaptive Crawling (New!)
```python
import asyncio
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler
async def adaptive_example():
async with AsyncWebCrawler() as crawler:
adaptive = AdaptiveCrawler(crawler)
# Start adaptive crawling
result = await adaptive.digest(
start_url="https://docs.python.org/3/",
query="async context managers"
)
# View results
adaptive.print_stats()
print(f"Crawled {len(result.crawled_urls)} pages")
print(f"Achieved {adaptive.confidence:.0%} confidence")
if __name__ == "__main__":
asyncio.run(adaptive_example())
```
- **Automatic stopping**: Stops when sufficient information is gathered
- **Intelligent link selection**: Follows only relevant links
- **Confidence scoring**: Know how complete your information is
## 8. Multi-URL Concurrency (Preview)
If you need to crawl multiple URLs in **parallel**, you can use `arun_many()`. By default, Crawl4AI employs a **MemoryAdaptiveDispatcher**, automatically adjusting concurrency based on system resources. Here’s a quick glimpse:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
async def quick_parallel_example():
urls = [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3"
]
run_conf = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
stream=True # Enable streaming mode
)
async with AsyncWebCrawler() as crawler:
# Stream results as they complete
async for result in await crawler.arun_many(urls, config=run_conf):
if result.success:
print(f"[OK] {result.url}, length: {len(result.markdown.raw_markdown)}")
else:
print(f"[ERROR] {result.url} => {result.error_message}")
# Or get all results at once (default behavior)
run_conf = run_conf.clone(stream=False)
results = await crawler.arun_many(urls, config=run_conf)
for res in results:
if res.success:
print(f"[OK] {res.url}, length: {len(res.markdown.raw_markdown)}")
else:
print(f"[ERROR] {res.url} => {res.error_message}")
if __name__ == "__main__":
asyncio.run(quick_parallel_example())
```
1. **Streaming mode** (`stream=True`): Process results as they become available using `async for`
2. **Batch mode** (`stream=False`): Wait for all results to complete
## 8. Dynamic Content Example
Some sites require multiple “page clicks” or dynamic JavaScript updates. Below is an example showing how to **click** a “Next Page” button and wait for new commits to load on GitHub, using **`BrowserConfig`** and **`CrawlerRunConfig`**:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai import JsonCssExtractionStrategy
async def extract_structured_data_using_css_extractor():
print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
schema = {
"name": "KidoCode Courses",
"baseSelector": "section.charge-methodology .w-tab-content > div",
"fields": [
{
"name": "section_title",
"selector": "h3.heading-50",
"type": "text",
},
{
"name": "section_description",
"selector": ".charge-content",
"type": "text",
},
{
"name": "course_name",
"selector": ".text-block-93",
"type": "text",
},
{
"name": "course_description",
"selector": ".course-content-text",
"type": "text",
},
{
"name": "course_icon",
"selector": ".image-92",
"type": "attribute",
"attribute": "src",
},
],
}
browser_config = BrowserConfig(headless=True, java_script_enabled=True)
js_click_tabs = """
(async () => {
const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");
for(let tab of tabs) {
tab.scrollIntoView();
tab.click();
await new Promise(r => setTimeout(r, 500));
}
})();
"""
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=JsonCssExtractionStrategy(schema),
js_code=[js_click_tabs],
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://www.kidocode.com/degrees/technology", config=crawler_config
)
companies = json.loads(result.extracted_content)
print(f"Successfully extracted {len(companies)} companies")
print(json.dumps(companies[0], indent=2))
async def main():
await extract_structured_data_using_css_extractor()
if __name__ == "__main__":
asyncio.run(main())
```
- **`BrowserConfig(headless=False)`**: We want to watch it click “Next Page.”
- **`CrawlerRunConfig(...)`**: We specify the extraction strategy, pass `session_id` to reuse the same page.
- **`js_code`** and **`wait_for`** are used for subsequent pages (`page > 0`) to click the “Next” button and wait for new commits to load.
- **`js_only=True`** indicates we’re not re-navigating but continuing the existing session.
- Finally, we call `kill_session()` to clean up the page and browser session.
## 9. Next Steps
1. Performed a basic crawl and printed Markdown.
2. Used **content filters** with a markdown generator.
3. Extracted JSON via **CSS** or **LLM** strategies.
4. Handled **dynamic** pages with JavaScript triggers.
## Core API
## AsyncWebCrawler
The **`AsyncWebCrawler`** is the core class for asynchronous web crawling in Crawl4AI. You typically create it **once**, optionally customize it with a **`BrowserConfig`** (e.g., headless, user agent), then **run** multiple **`arun()`** calls with different **`CrawlerRunConfig`** objects.
1. **Create** a `BrowserConfig` for global browser settings.
2. **Instantiate** `AsyncWebCrawler(config=browser_config)`.
3. **Use** the crawler in an async context manager (`async with`) or manage start/close manually.
4. **Call** `arun(url, config=crawler_run_config)` for each page you want.
## 1. Constructor Overview
```python
class AsyncWebCrawler:
def __init__(
self,
crawler_strategy: Optional[AsyncCrawlerStrategy] = None,
config: Optional[BrowserConfig] = None,
always_bypass_cache: bool = False, # deprecated
always_by_pass_cache: Optional[bool] = None, # also deprecated
base_directory: str = ...,
thread_safe: bool = False,
**kwargs,
):
"""
Create an AsyncWebCrawler instance.
Args:
crawler_strategy:
(Advanced) Provide a custom crawler strategy if needed.
config:
A BrowserConfig object specifying how the browser is set up.
always_bypass_cache:
(Deprecated) Use CrawlerRunConfig.cache_mode instead.
base_directory:
Folder for storing caches/logs (if relevant).
thread_safe:
If True, attempts some concurrency safeguards. Usually False.
**kwargs:
Additional legacy or debugging parameters.
"""
)
### Typical Initialization
```
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig
browser_cfg = BrowserConfig(
browser_type="chromium",
headless=True,
verbose=True
crawler = AsyncWebCrawler(config=browser_cfg)
```
**Notes**:
- **Legacy** parameters like `always_bypass_cache` remain for backward compatibility, but prefer to set **caching** in `CrawlerRunConfig`.
---
## 2. Lifecycle: Start/Close or Context Manager
### 2.1 Context Manager (Recommended)
```python
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun("<https://example.com>")
## The crawler automatically starts/closes resources
```
When the `async with` block ends, the crawler cleans up (closes the browser, etc.).
### 2.2 Manual Start & Close
```python
crawler = AsyncWebCrawler(config=browser_cfg)
await crawler.start()
result1 = await crawler.arun("<https://example.com>")
result2 = await crawler.arun("<https://another.com>")
await crawler.close()
```
Use this style if you have a **long-running** application or need full control of the crawler’s lifecycle.
---
## 3. Primary Method: `arun()`
```python
async def arun(
url: str,
config: Optional[CrawlerRunConfig] = None,
## Legacy parameters for backward compatibility...
```
### 3.1 New Approach
You pass a `CrawlerRunConfig` object that sets up everything about a crawl—content filtering, caching, session reuse, JS code, screenshots, etc.
```python
import asyncio
from crawl4ai import CrawlerRunConfig, CacheMode
run_cfg = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
css_selector="main.article",
word_count_threshold=10,
screenshot=True
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun("<https://example.com/news>", config=run_cfg)
```
### 3.2 Legacy Parameters Still Accepted
For **backward** compatibility, `arun()` can still accept direct arguments like `css_selector=...`, `word_count_threshold=...`, etc., but we strongly advise migrating them into a **`CrawlerRunConfig`**.
---
## 4. Batch Processing: `arun_many()`
```python
async def arun_many(
urls: List[str],
config: Optional[CrawlerRunConfig] = None,
## Legacy parameters maintained for backwards compatibility...
```
### 4.1 Resource-Aware Crawling
The `arun_many()` method now uses an intelligent dispatcher that:
- Monitors system memory usage
- Implements adaptive rate limiting
- Provides detailed progress monitoring
- Manages concurrent crawls efficiently
### 4.2 Example Usage
Check page [Multi-url Crawling](../advanced/multi-url-crawling.md) for a detailed example of how to use `arun_many()`.
```python
## 4.3 Key Features
1. **Rate Limiting**
- Automatic delay between requests
- Exponential backoff on rate limit detection
- Domain-specific rate limiting
- Configurable retry strategy
2. **Resource Monitoring**
- Memory usage tracking
- Adaptive concurrency based on system load
- Automatic pausing when resources are constrained
3. **Progress Monitoring**
- Detailed or aggregated progress display
- Real-time status updates
- Memory usage statistics
4. **Error Handling**
- Graceful handling of rate limits
- Automatic retries with backoff
- Detailed error reporting
## 5. `CrawlResult` Output
Each `arun()` returns a **`CrawlResult`** containing:
- `url`: Final URL (if redirected).
- `html`: Original HTML.
- `cleaned_html`: Sanitized HTML.
- `markdown_v2`: Deprecated. Instead just use regular `markdown`
- `extracted_content`: If an extraction strategy was used (JSON for CSS/LLM strategies).
- `screenshot`, `pdf`: If screenshots/PDF requested.
- `media`, `links`: Information about discovered images/links.
- `success`, `error_message`: Status info.
## 6. Quick Example
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai import JsonCssExtractionStrategy
import json
async def main():
# 1. Browser config
browser_cfg = BrowserConfig(
browser_type="firefox",
headless=False,
verbose=True
)
# 2. Run config
schema = {
"name": "Articles",
"baseSelector": "article.post",
"fields": [
{
"name": "title",
"selector": "h2",
"type": "text"
},
{
"name": "url",
"selector": "a",
"type": "attribute",
"attribute": "href"
}
]
}
run_cfg = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=JsonCssExtractionStrategy(schema),
word_count_threshold=15,
remove_overlay_elements=True,
wait_for="css:.post" # Wait for posts to appear
)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun(
url="https://example.com/blog",
config=run_cfg
)
if result.success:
print("Cleaned HTML length:", len(result.cleaned_html))
if result.extracted_content:
articles = json.loads(result.extracted_content)
print("Extracted articles:", articles[:2])
else:
print("Error:", result.error_message)
asyncio.run(main())
```
- We define a **`BrowserConfig`** with Firefox, no headless, and `verbose=True`.
- We define a **`CrawlerRunConfig`** that **bypasses cache**, uses a **CSS** extraction schema, has a `word_count_threshold=15`, etc.
- We pass them to `AsyncWebCrawler(config=...)` and `arun(url=..., config=...)`.
## 7. Best Practices & Migration Notes
1. **Use** `BrowserConfig` for **global** settings about the browser’s environment.
2. **Use** `CrawlerRunConfig` for **per-crawl** logic (caching, content filtering, extraction strategies, wait conditions).
3. **Avoid** legacy parameters like `css_selector` or `word_count_threshold` directly in `arun()`. Instead:
```python
run_cfg = CrawlerRunConfig(css_selector=".main-content", word_count_threshold=20)
result = await crawler.arun(url="...", config=run_cfg)
```
## 8. Summary
- **Constructor** accepts **`BrowserConfig`** (or defaults).
- **`arun(url, config=CrawlerRunConfig)`** is the main method for single-page crawls.
- **`arun_many(urls, config=CrawlerRunConfig)`** handles concurrency across multiple URLs.
- For advanced lifecycle control, use `start()` and `close()` explicitly.
- If you used `AsyncWebCrawler(browser_type="chromium", css_selector="...")`, move browser settings to `BrowserConfig(...)` and content/crawl logic to `CrawlerRunConfig(...)`.
## `arun()` Parameter Guide (New Approach)
In Crawl4AI’s **latest** configuration model, nearly all parameters that once went directly to `arun()` are now part of **`CrawlerRunConfig`**. When calling `arun()`, you provide:
```python
await crawler.arun(
url="https://example.com",
config=my_run_config
)
```
Below is an organized look at the parameters that can go inside `CrawlerRunConfig`, divided by their functional areas. For **Browser** settings (e.g., `headless`, `browser_type`), see [BrowserConfig](./parameters.md).
## 1. Core Usage
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
async def main():
run_config = CrawlerRunConfig(
verbose=True, # Detailed logging
cache_mode=CacheMode.ENABLED, # Use normal read/write cache
check_robots_txt=True, # Respect robots.txt rules
# ... other parameters
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
config=run_config
)
# Check if blocked by robots.txt
if not result.success and result.status_code == 403:
print(f"Error: {result.error_message}")
```
- `verbose=True` logs each crawl step.
- `cache_mode` decides how to read/write the local crawl cache.
## 2. Cache Control
**`cache_mode`** (default: `CacheMode.ENABLED`)
Use a built-in enum from `CacheMode`:
- `ENABLED`: Normal caching—reads if available, writes if missing.
- `DISABLED`: No caching—always refetch pages.
- `READ_ONLY`: Reads from cache only; no new writes.
- `WRITE_ONLY`: Writes to cache but doesn’t read existing data.
- `BYPASS`: Skips reading cache for this crawl (though it might still write if set up that way).
```python
run_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS
)
```
- `bypass_cache=True` acts like `CacheMode.BYPASS`.
- `disable_cache=True` acts like `CacheMode.DISABLED`.
- `no_cache_read=True` acts like `CacheMode.WRITE_ONLY`.
- `no_cache_write=True` acts like `CacheMode.READ_ONLY`.
## 3. Content Processing & Selection
### 3.1 Text Processing
```python
run_config = CrawlerRunConfig(
word_count_threshold=10, # Ignore text blocks <10 words
only_text=False, # If True, tries to remove non-text elements
keep_data_attributes=False # Keep or discard data-* attributes
)
```
### 3.2 Content Selection
```python
run_config = CrawlerRunConfig(
css_selector=".main-content", # Focus on .main-content region only
excluded_tags=["form", "nav"], # Remove entire tag blocks
remove_forms=True, # Specifically strip <form> elements
remove_overlay_elements=True, # Attempt to remove modals/popups
)
```
### 3.3 Link Handling
```python
run_config = CrawlerRunConfig(
exclude_external_links=True, # Remove external links from final content
exclude_social_media_links=True, # Remove links to known social sites
exclude_domains=["ads.example.com"], # Exclude links to these domains
exclude_social_media_domains=["facebook.com","twitter.com"], # Extend the default list
)
```
### 3.4 Media Filtering
```python
run_config = CrawlerRunConfig(
exclude_external_images=True # Strip images from other domains
)
```
## 4. Page Navigation & Timing
### 4.1 Basic Browser Flow
```python
run_config = CrawlerRunConfig(
wait_for="css:.dynamic-content", # Wait for .dynamic-content
delay_before_return_html=2.0, # Wait 2s before capturing final HTML
page_timeout=60000, # Navigation & script timeout (ms)
)
```
- `wait_for`:
- `"css:selector"` or
- `"js:() => boolean"`
e.g. `js:() => document.querySelectorAll('.item').length > 10`.
- `mean_delay` & `max_range`: define random delays for `arun_many()` calls.
- `semaphore_count`: concurrency limit when crawling multiple URLs.
### 4.2 JavaScript Execution
```python
run_config = CrawlerRunConfig(
js_code=[
"window.scrollTo(0, document.body.scrollHeight);",
"document.querySelector('.load-more')?.click();"
],
js_only=False
)
```
- `js_code` can be a single string or a list of strings.
- `js_only=True` means “I’m continuing in the same session with new JS steps, no new full navigation.”
### 4.3 Anti-Bot
```python
run_config = CrawlerRunConfig(
magic=True,
simulate_user=True,
override_navigator=True
)
```
- `magic=True` tries multiple stealth features.
- `simulate_user=True` mimics mouse movements or random delays.
- `override_navigator=True` fakes some navigator properties (like user agent checks).
## 5. Session Management
**`session_id`**:
```python
run_config = CrawlerRunConfig(
session_id="my_session123"
)
```
If re-used in subsequent `arun()` calls, the same tab/page context is continued (helpful for multi-step tasks or stateful browsing).
## 6. Screenshot, PDF & Media Options
```python
run_config = CrawlerRunConfig(
screenshot=True, # Grab a screenshot as base64
screenshot_wait_for=1.0, # Wait 1s before capturing
pdf=True, # Also produce a PDF
image_description_min_word_threshold=5, # If analyzing alt text
image_score_threshold=3, # Filter out low-score images
)
```
- `result.screenshot` → Base64 screenshot string.
- `result.pdf` → Byte array with PDF data.
## 7. Extraction Strategy
**For advanced data extraction** (CSS/LLM-based), set `extraction_strategy`:
```python
run_config = CrawlerRunConfig(
extraction_strategy=my_css_or_llm_strategy
)
```
The extracted data will appear in `result.extracted_content`.
## 8. Comprehensive Example
Below is a snippet combining many parameters:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai import JsonCssExtractionStrategy
async def main():
# Example schema
schema = {
"name": "Articles",
"baseSelector": "article.post",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
]
}
run_config = CrawlerRunConfig(
# Core
verbose=True,
cache_mode=CacheMode.ENABLED,
check_robots_txt=True, # Respect robots.txt rules
# Content
word_count_threshold=10,
css_selector="main.content",
excluded_tags=["nav", "footer"],
exclude_external_links=True,
# Page & JS
js_code="document.querySelector('.show-more')?.click();",
wait_for="css:.loaded-block",
page_timeout=30000,
# Extraction
extraction_strategy=JsonCssExtractionStrategy(schema),
# Session
session_id="persistent_session",
# Media
screenshot=True,
pdf=True,
# Anti-bot
simulate_user=True,
magic=True,
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com/posts", config=run_config)
if result.success:
print("HTML length:", len(result.cleaned_html))
print("Extraction JSON:", result.extracted_content)
if result.screenshot:
print("Screenshot length:", len(result.screenshot))
if result.pdf:
print("PDF bytes length:", len(result.pdf))
else:
print("Error:", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
```
1. **Crawling** the main content region, ignoring external links.
2. Running **JavaScript** to click “.show-more”.
3. **Waiting** for “.loaded-block” to appear.
4. Generating a **screenshot** & **PDF** of the final page.
## 9. Best Practices
1. **Use `BrowserConfig` for global browser** settings (headless, user agent).
2. **Use `CrawlerRunConfig`** to handle the **specific** crawl needs: content filtering, caching, JS, screenshot, extraction, etc.
4. **Limit** large concurrency (`semaphore_count`) if the site or your system can’t handle it.
5. For dynamic pages, set `js_code` or `scan_full_page` so you load all content.
## 10. Conclusion
All parameters that used to be direct arguments to `arun()` now belong in **`CrawlerRunConfig`**. This approach:
- Makes code **clearer** and **more maintainable**.
## `arun_many(...)` Reference
> **Note**: This function is very similar to [`arun()`](./arun.md) but focused on **concurrent** or **batch** crawling. If you’re unfamiliar with `arun()` usage, please read that doc first, then review this for differences.
## Function Signature
```python
async def arun_many(
urls: Union[List[str], List[Any]],
config: Optional[Union[CrawlerRunConfig, List[CrawlerRunConfig]]] = None,
dispatcher: Optional[BaseDispatcher] = None,
...
) -> Union[List[CrawlResult], AsyncGenerator[CrawlResult, None]]:
"""
Crawl multiple URLs concurrently or in batches.
:param urls: A list of URLs (or tasks) to crawl.
:param config: (Optional) Either:
- A single `CrawlerRunConfig` applying to all URLs
- A list of `CrawlerRunConfig` objects with url_matcher patterns
:param dispatcher: (Optional) A concurrency controller (e.g. MemoryAdaptiveDispatcher).
...
:return: Either a list of `CrawlResult` objects, or an async generator if streaming is enabled.
"""
```
## Differences from `arun()`
1. **Multiple URLs**:
- Instead of crawling a single URL, you pass a list of them (strings or tasks).
- The function returns either a **list** of `CrawlResult` or an **async generator** if streaming is enabled.
2. **Concurrency & Dispatchers**:
- **`dispatcher`** param allows advanced concurrency control.
- If omitted, a default dispatcher (like `MemoryAdaptiveDispatcher`) is used internally.
3. **Streaming Support**:
- Enable streaming by setting `stream=True` in your `CrawlerRunConfig`.
- When streaming, use `async for` to process results as they become available.
4. **Parallel** Execution**:
- `arun_many()` can run multiple requests concurrently under the hood.
- Each `CrawlResult` might also include a **`dispatch_result`** with concurrency details (like memory usage, start/end times).
### Basic Example (Batch Mode)
```python
# Minimal usage: The default dispatcher will be used
results = await crawler.arun_many(
urls=["https://site1.com", "https://site2.com"],
config=CrawlerRunConfig(stream=False) # Default behavior
)
for res in results:
if res.success:
print(res.url, "crawled OK!")
else:
print("Failed:", res.url, "-", res.error_message)
```
### Streaming Example
```python
config = CrawlerRunConfig(
stream=True, # Enable streaming mode
cache_mode=CacheMode.BYPASS
)
# Process results as they complete
async for result in await crawler.arun_many(
urls=["https://site1.com", "https://site2.com", "https://site3.com"],
config=config
):
if result.success:
print(f"Just completed: {result.url}")
# Process each result immediately
process_result(result)
```
### With a Custom Dispatcher
```python
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=70.0,
max_session_permit=10
)
results = await crawler.arun_many(
urls=["https://site1.com", "https://site2.com", "https://site3.com"],
config=my_run_config,
dispatcher=dispatcher
)
```
### URL-Specific Configurations
Instead of using one config for all URLs, provide a list of configs with `url_matcher` patterns:
```python
from crawl4ai import CrawlerRunConfig, MatchMode
from crawl4ai.processors.pdf import PDFContentScrapingStrategy
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
# PDF files - specialized extraction
pdf_config = CrawlerRunConfig(
url_matcher="*.pdf",
scraping_strategy=PDFContentScrapingStrategy()
)
# Blog/article pages - content filtering
blog_config = CrawlerRunConfig(
url_matcher=["*/blog/*", "*/article/*", "*python.org*"],
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.48)
)
)
# Dynamic pages - JavaScript execution
github_config = CrawlerRunConfig(
url_matcher=lambda url: 'github.com' in url,
js_code="window.scrollTo(0, 500);"
)
# API endpoints - JSON extraction
api_config = CrawlerRunConfig(
url_matcher=lambda url: 'api' in url or url.endswith('.json'),
# Custome settings for JSON extraction
)
# Default fallback config
default_config = CrawlerRunConfig() # No url_matcher means it never matches except as fallback
# Pass the list of configs - first match wins!
results = await crawler.arun_many(
urls=[
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf", # → pdf_config
"https://blog.python.org/", # → blog_config
"https://github.com/microsoft/playwright", # → github_config
"https://httpbin.org/json", # → api_config
"https://example.com/" # → default_config
],
config=[pdf_config, blog_config, github_config, api_config, default_config]
)
```
- **String patterns**: `"*.pdf"`, `"*/blog/*"`, `"*python.org*"`
- **Function matchers**: `lambda url: 'api' in url`
- **Mixed patterns**: Combine strings and functions with `MatchMode.OR` or `MatchMode.AND`
- **First match wins**: Configs are evaluated in order
- `dispatch_result` in each `CrawlResult` (if using concurrency) can hold memory and timing info.
- **Important**: Always include a default config (without `url_matcher`) as the last item if you want to handle all URLs. Otherwise, unmatched URLs will fail.
### Return Value
Either a **list** of [`CrawlResult`](./crawl-result.md) objects, or an **async generator** if streaming is enabled. You can iterate to check `result.success` or read each item’s `extracted_content`, `markdown`, or `dispatch_result`.
## Dispatcher Reference
- **`MemoryAdaptiveDispatcher`**: Dynamically manages concurrency based on system memory usage.
- **`SemaphoreDispatcher`**: Fixed concurrency limit, simpler but less adaptive.
## Common Pitfalls
3. **Error Handling**: Each `CrawlResult` might fail for different reasons—always check `result.success` or the `error_message` before proceeding.
## Conclusion
Use `arun_many()` when you want to **crawl multiple URLs** simultaneously or in controlled parallel tasks. If you need advanced concurrency features (like memory-based adaptive throttling or complex rate-limiting), provide a **dispatcher**. Each result is a standard `CrawlResult`, possibly augmented with concurrency stats (`dispatch_result`) for deeper inspection. For more details on concurrency logic and dispatchers, see the [Advanced Multi-URL Crawling](../advanced/multi-url-crawling.md) docs.
## `CrawlResult` Reference
The **`CrawlResult`** class encapsulates everything returned after a single crawl operation. It provides the **raw or processed content**, details on links and media, plus optional metadata (like screenshots, PDFs, or extracted JSON).
**Location**: `crawl4ai/crawler/models.py` (for reference)
```python
class CrawlResult(BaseModel):
url: str
html: str
success: bool
cleaned_html: Optional[str] = None
fit_html: Optional[str] = None # Preprocessed HTML optimized for extraction
media: Dict[str, List[Dict]] = {}
links: Dict[str, List[Dict]] = {}
downloaded_files: Optional[List[str]] = None
screenshot: Optional[str] = None
pdf : Optional[bytes] = None
mhtml: Optional[str] = None
markdown: Optional[Union[str, MarkdownGenerationResult]] = None
extracted_content: Optional[str] = None
metadata: Optional[dict] = None
error_message: Optional[str] = None
session_id: Optional[str] = None
response_headers: Optional[dict] = None
status_code: Optional[int] = None
ssl_certificate: Optional[SSLCertificate] = None
dispatch_result: Optional[DispatchResult] = None
...
```
## 1. Basic Crawl Info
### 1.1 **`url`** *(str)*
```python
print(result.url) # e.g., "https://example.com/"
```
### 1.2 **`success`** *(bool)*
**What**: `True` if the crawl pipeline ended without major errors; `False` otherwise.
```python
if not result.success:
print(f"Crawl failed: {result.error_message}")
```
### 1.3 **`status_code`** *(Optional[int])*
```python
if result.status_code == 404:
print("Page not found!")
```
### 1.4 **`error_message`** *(Optional[str])*
**What**: If `success=False`, a textual description of the failure.
```python
if not result.success:
print("Error:", result.error_message)
```
### 1.5 **`session_id`** *(Optional[str])*
```python
# If you used session_id="login_session" in CrawlerRunConfig, see it here:
print("Session:", result.session_id)
```
### 1.6 **`response_headers`** *(Optional[dict])*
```python
if result.response_headers:
print("Server:", result.response_headers.get("Server", "Unknown"))
```
### 1.7 **`ssl_certificate`** *(Optional[SSLCertificate])*
**What**: If `fetch_ssl_certificate=True` in your CrawlerRunConfig, **`result.ssl_certificate`** contains a [**`SSLCertificate`**](../advanced/ssl-certificate.md) object describing the site's certificate. You can export the cert in multiple formats (PEM/DER/JSON) or access its properties like `issuer`,
`subject`, `valid_from`, `valid_until`, etc.
```python
if result.ssl_certificate:
print("Issuer:", result.ssl_certificate.issuer)
```
## 2. Raw / Cleaned Content
### 2.1 **`html`** *(str)*
```python
# Possibly large
print(len(result.html))
```
### 2.2 **`cleaned_html`** *(Optional[str])*
**What**: A sanitized HTML version—scripts, styles, or excluded tags are removed based on your `CrawlerRunConfig`.
```python
print(result.cleaned_html[:500]) # Show a snippet
```
## 3. Markdown Fields
### 3.1 The Markdown Generation Approach
- **Raw** markdown
- **Links as citations** (with a references section)
- **Fit** markdown if a **content filter** is used (like Pruning or BM25)
**`MarkdownGenerationResult`** includes:
- **`raw_markdown`** *(str)*: The full HTML→Markdown conversion.
- **`markdown_with_citations`** *(str)*: Same markdown, but with link references as academic-style citations.
- **`references_markdown`** *(str)*: The reference list or footnotes at the end.
- **`fit_markdown`** *(Optional[str])*: If content filtering (Pruning/BM25) was applied, the filtered "fit" text.
- **`fit_html`** *(Optional[str])*: The HTML that led to `fit_markdown`.
```python
if result.markdown:
md_res = result.markdown
print("Raw MD:", md_res.raw_markdown[:300])
print("Citations MD:", md_res.markdown_with_citations[:300])
print("References:", md_res.references_markdown)
if md_res.fit_markdown:
print("Pruned text:", md_res.fit_markdown[:300])
```
### 3.2 **`markdown`** *(Optional[Union[str, MarkdownGenerationResult]])*
**What**: Holds the `MarkdownGenerationResult`.
```python
print(result.markdown.raw_markdown[:200])
print(result.markdown.fit_markdown)
print(result.markdown.fit_html)
```
**Important**: "Fit" content (in `fit_markdown`/`fit_html`) exists in result.markdown, only if you used a **filter** (like **PruningContentFilter** or **BM25ContentFilter**) within a `MarkdownGenerationStrategy`.
## 4. Media & Links
### 4.1 **`media`** *(Dict[str, List[Dict]])*
**What**: Contains info about discovered images, videos, or audio. Typically keys: `"images"`, `"videos"`, `"audios"`.
- `src` *(str)*: Media URL
- `alt` or `title` *(str)*: Descriptive text
- `score` *(float)*: Relevance score if the crawler's heuristic found it "important"
- `desc` or `description` *(Optional[str])*: Additional context extracted from surrounding text
```python
images = result.media.get("images", [])
for img in images:
if img.get("score", 0) > 5:
print("High-value image:", img["src"])
```
### 4.2 **`links`** *(Dict[str, List[Dict]])*
**What**: Holds internal and external link data. Usually two keys: `"internal"` and `"external"`.
- `href` *(str)*: The link target
- `text` *(str)*: Link text
- `title` *(str)*: Title attribute
- `context` *(str)*: Surrounding text snippet
- `domain` *(str)*: If external, the domain
```python
for link in result.links["internal"]:
print(f"Internal link to {link['href']} with text {link['text']}")
```
## 5. Additional Fields
### 5.1 **`extracted_content`** *(Optional[str])*
**What**: If you used **`extraction_strategy`** (CSS, LLM, etc.), the structured output (JSON).
```python
if result.extracted_content:
data = json.loads(result.extracted_content)
print(data)
```
### 5.2 **`downloaded_files`** *(Optional[List[str]])*
**What**: If `accept_downloads=True` in your `BrowserConfig` + `downloads_path`, lists local file paths for downloaded items.
```python
if result.downloaded_files:
for file_path in result.downloaded_files:
print("Downloaded:", file_path)
```
### 5.3 **`screenshot`** *(Optional[str])*
**What**: Base64-encoded screenshot if `screenshot=True` in `CrawlerRunConfig`.
```python
import base64
if result.screenshot:
with open("page.png", "wb") as f:
f.write(base64.b64decode(result.screenshot))
```
### 5.4 **`pdf`** *(Optional[bytes])*
**What**: Raw PDF bytes if `pdf=True` in `CrawlerRunConfig`.
```python
if result.pdf:
with open("page.pdf", "wb") as f:
f.write(result.pdf)
```
### 5.5 **`mhtml`** *(Optional[str])*
**What**: MHTML snapshot of the page if `capture_mhtml=True` in `CrawlerRunConfig`. MHTML (MIME HTML) format preserves the entire web page with all its resources (CSS, images, scripts, etc.) in a single file.
```python
if result.mhtml:
with open("page.mhtml", "w", encoding="utf-8") as f:
f.write(result.mhtml)
```
### 5.6 **`metadata`** *(Optional[dict])*
```python
if result.metadata:
print("Title:", result.metadata.get("title"))
print("Author:", result.metadata.get("author"))
```
## 6. `dispatch_result` (optional)
A `DispatchResult` object providing additional concurrency and resource usage information when crawling URLs in parallel (e.g., via `arun_many()` with custom dispatchers). It contains:
- **`task_id`**: A unique identifier for the parallel task.
- **`memory_usage`** (float): The memory (in MB) used at the time of completion.
- **`peak_memory`** (float): The peak memory usage (in MB) recorded during the task's execution.
- **`start_time`** / **`end_time`** (datetime): Time range for this crawling task.
- **`error_message`** (str): Any dispatcher- or concurrency-related error encountered.
```python
# Example usage:
for result in results:
if result.success and result.dispatch_result:
dr = result.dispatch_result
print(f"URL: {result.url}, Task ID: {dr.task_id}")
print(f"Memory: {dr.memory_usage:.1f} MB (Peak: {dr.peak_memory:.1f} MB)")
print(f"Duration: {dr.end_time - dr.start_time}")
```
> **Note**: This field is typically populated when using `arun_many(...)` alongside a **dispatcher** (e.g., `MemoryAdaptiveDispatcher` or `SemaphoreDispatcher`). If no concurrency or dispatcher is used, `dispatch_result` may remain `None`.
## 7. Network Requests & Console Messages
When you enable network and console message capturing in `CrawlerRunConfig` using `capture_network_requests=True` and `capture_console_messages=True`, the `CrawlResult` will include these fields:
### 7.1 **`network_requests`** *(Optional[List[Dict[str, Any]]])*
- Each item has an `event_type` field that can be `"request"`, `"response"`, or `"request_failed"`.
- Request events include `url`, `method`, `headers`, `post_data`, `resource_type`, and `is_navigation_request`.
- Response events include `url`, `status`, `status_text`, `headers`, and `request_timing`.
- Failed request events include `url`, `method`, `resource_type`, and `failure_text`.
- All events include a `timestamp` field.
```python
if result.network_requests:
# Count different types of events
requests = [r for r in result.network_requests if r.get("event_type") == "request"]
responses = [r for r in result.network_requests if r.get("event_type") == "response"]
failures = [r for r in result.network_requests if r.get("event_type") == "request_failed"]
print(f"Captured {len(requests)} requests, {len(responses)} responses, and {len(failures)} failures")
# Analyze API calls
api_calls = [r for r in requests if "api" in r.get("url", "")]
# Identify failed resources
for failure in failures:
print(f"Failed to load: {failure.get('url')} - {failure.get('failure_text')}")
```
### 7.2 **`console_messages`** *(Optional[List[Dict[str, Any]]])*
- Each item has a `type` field indicating the message type (e.g., `"log"`, `"error"`, `"warning"`, etc.).
- The `text` field contains the actual message text.
- Some messages include `location` information (URL, line, column).
- All messages include a `timestamp` field.
```python
if result.console_messages:
# Count messages by type
message_types = {}
for msg in result.console_messages:
msg_type = msg.get("type", "unknown")
message_types[msg_type] = message_types.get(msg_type, 0) + 1
print(f"Message type counts: {message_types}")
# Display errors (which are usually most important)
for msg in result.console_messages:
if msg.get("type") == "error":
print(f"Error: {msg.get('text')}")
```
## 8. Example: Accessing Everything
```python
async def handle_result(result: CrawlResult):
if not result.success:
print("Crawl error:", result.error_message)
return
# Basic info
print("Crawled URL:", result.url)
print("Status code:", result.status_code)
# HTML
print("Original HTML size:", len(result.html))
print("Cleaned HTML size:", len(result.cleaned_html or ""))
# Markdown output
if result.markdown:
print("Raw Markdown:", result.markdown.raw_markdown[:300])
print("Citations Markdown:", result.markdown.markdown_with_citations[:300])
if result.markdown.fit_markdown:
print("Fit Markdown:", result.markdown.fit_markdown[:200])
# Media & Links
if "images" in result.media:
print("Image count:", len(result.media["images"]))
if "internal" in result.links:
print("Internal link count:", len(result.links["internal"]))
# Extraction strategy result
if result.extracted_content:
print("Structured data:", result.extracted_content)
# Screenshot/PDF/MHTML
if result.screenshot:
print("Screenshot length:", len(result.screenshot))
if result.pdf:
print("PDF bytes length:", len(result.pdf))
if result.mhtml:
print("MHTML length:", len(result.mhtml))
# Network and console capturing
if result.network_requests:
print(f"Network requests captured: {len(result.network_requests)}")
# Analyze request types
req_types = {}
for req in result.network_requests:
if "resource_type" in req:
req_types[req["resource_type"]] = req_types.get(req["resource_type"], 0) + 1
print(f"Resource types: {req_types}")
if result.console_messages:
print(f"Console messages captured: {len(result.console_messages)}")
# Count by message type
msg_types = {}
for msg in result.console_messages:
msg_types[msg.get("type", "unknown")] = msg_types.get(msg.get("type", "unknown"), 0) + 1
print(f"Message types: {msg_types}")
```
## 9. Key Points & Future
1. **Deprecated legacy properties of CrawlResult**
- `markdown_v2` - Deprecated in v0.5. Just use `markdown`. It holds the `MarkdownGenerationResult` now!
- `fit_markdown` and `fit_html` - Deprecated in v0.5. They can now be accessed via `MarkdownGenerationResult` in `result.markdown`. eg: `result.markdown.fit_markdown` and `result.markdown.fit_html`
2. **Fit Content**
- **`fit_markdown`** and **`fit_html`** appear in MarkdownGenerationResult, only if you used a content filter (like **PruningContentFilter** or **BM25ContentFilter**) inside your **MarkdownGenerationStrategy** or set them directly.
- If no filter is used, they remain `None`.
3. **References & Citations**
- If you enable link citations in your `DefaultMarkdownGenerator` (`options={"citations": True}`), you’ll see `markdown_with_citations` plus a **`references_markdown`** block. This helps large language models or academic-like referencing.
4. **Links & Media**
- `links["internal"]` and `links["external"]` group discovered anchors by domain.
- `media["images"]` / `["videos"]` / `["audios"]` store extracted media elements with optional scoring or context.
5. **Error Cases**
- If `success=False`, check `error_message` (e.g., timeouts, invalid URLs).
- `status_code` might be `None` if we failed before an HTTP response.
Use **`CrawlResult`** to glean all final outputs and feed them into your data pipelines, AI models, or archives. With the synergy of a properly configured **BrowserConfig** and **CrawlerRunConfig**, the crawler can produce robust, structured results here in **`CrawlResult`**.
## Configuration
## Browser, Crawler & LLM Configuration (Quick Overview)
Crawl4AI's flexibility stems from two key classes:
1. **`BrowserConfig`** – Dictates **how** the browser is launched and behaves (e.g., headless or visible, proxy, user agent).
2. **`CrawlerRunConfig`** – Dictates **how** each **crawl** operates (e.g., caching, extraction, timeouts, JavaScript code to run, etc.).
3. **`LLMConfig`** - Dictates **how** LLM providers are configured. (model, api token, base url, temperature etc.)
In most examples, you create **one** `BrowserConfig` for the entire crawler session, then pass a **fresh** or re-used `CrawlerRunConfig` whenever you call `arun()`. This tutorial shows the most commonly used parameters. If you need advanced or rarely used fields, see the [Configuration Parameters](../api/parameters.md).
## 1. BrowserConfig Essentials
```python
class BrowserConfig:
def __init__(
browser_type="chromium",
headless=True,
proxy_config=None,
viewport_width=1080,
viewport_height=600,
verbose=True,
use_persistent_context=False,
user_data_dir=None,
cookies=None,
headers=None,
user_agent=None,
text_mode=False,
light_mode=False,
extra_args=None,
enable_stealth=False,
# ... other advanced parameters omitted here
):
...
```
### Key Fields to Note
1. **`browser_type`**
- Options: `"chromium"`, `"firefox"`, or `"webkit"`.
- Defaults to `"chromium"`.
- If you need a different engine, specify it here.
2. **`headless`**
- `True`: Runs the browser in headless mode (invisible browser).
- `False`: Runs the browser in visible mode, which helps with debugging.
3. **`proxy_config`**
- A dictionary with fields like:
```json
{
"server": "http://proxy.example.com:8080",
"username": "...",
"password": "..."
}
```
- Leave as `None` if a proxy is not required.
1. **`viewport_width` & `viewport_height`**:
- The initial window size.
- Some sites behave differently with smaller or bigger viewports.
2. **`verbose`**:
- If `True`, prints extra logs.
- Handy for debugging.
3. **`use_persistent_context`**:
- If `True`, uses a **persistent** browser profile, storing cookies/local storage across runs.
- Typically also set `user_data_dir` to point to a folder.
4. **`cookies`** & **`headers`**:
- E.g. `cookies=[{"name": "session", "value": "abc123", "domain": "example.com"}]`.
5. **`user_agent`**:
- Custom User-Agent string. If `None`, a default is used.
- You can also set `user_agent_mode="random"` for randomization (if you want to fight bot detection).
6. **`text_mode`** & **`light_mode`**:
- `text_mode=True` disables images, possibly speeding up text-only crawls.
- `light_mode=True` turns off certain background features for performance.
7. **`extra_args`**:
- Additional flags for the underlying browser.
- E.g. `["--disable-extensions"]`.
8. **`enable_stealth`**:
- If `True`, enables stealth mode using playwright-stealth.
- Modifies browser fingerprints to avoid basic bot detection.
- Default is `False`. Recommended for sites with bot protection.
### Helper Methods
Both configuration classes provide a `clone()` method to create modified copies:
```python
# Create a base browser config
base_browser = BrowserConfig(
browser_type="chromium",
headless=True,
text_mode=True
)
# Create a visible browser config for debugging
debug_browser = base_browser.clone(
headless=False,
verbose=True
)
```
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig
browser_conf = BrowserConfig(
browser_type="firefox",
headless=False,
text_mode=True
)
async with AsyncWebCrawler(config=browser_conf) as crawler:
result = await crawler.arun("https://example.com")
print(result.markdown[:300])
```
## 2. CrawlerRunConfig Essentials
```python
class CrawlerRunConfig:
def __init__(
word_count_threshold=200,
extraction_strategy=None,
markdown_generator=None,
cache_mode=None,
js_code=None,
wait_for=None,
screenshot=False,
pdf=False,
capture_mhtml=False,
# Location and Identity Parameters
locale=None, # e.g. "en-US", "fr-FR"
timezone_id=None, # e.g. "America/New_York"
geolocation=None, # GeolocationConfig object
# Resource Management
enable_rate_limiting=False,
rate_limit_config=None,
memory_threshold_percent=70.0,
check_interval=1.0,
max_session_permit=20,
display_mode=None,
verbose=True,
stream=False, # Enable streaming for arun_many()
# ... other advanced parameters omitted
):
...
```
### Key Fields to Note
1. **`word_count_threshold`**:
- The minimum word count before a block is considered.
- If your site has lots of short paragraphs or items, you can lower it.
2. **`extraction_strategy`**:
- Where you plug in JSON-based extraction (CSS, LLM, etc.).
- If `None`, no structured extraction is done (only raw/cleaned HTML + markdown).
3. **`markdown_generator`**:
- E.g., `DefaultMarkdownGenerator(...)`, controlling how HTML→Markdown conversion is done.
- If `None`, a default approach is used.
4. **`cache_mode`**:
- Controls caching behavior (`ENABLED`, `BYPASS`, `DISABLED`, etc.).
- If `None`, defaults to some level of caching or you can specify `CacheMode.ENABLED`.
5. **`js_code`**:
- A string or list of JS strings to execute.
- Great for "Load More" buttons or user interactions.
6. **`wait_for`**:
- A CSS or JS expression to wait for before extracting content.
- Common usage: `wait_for="css:.main-loaded"` or `wait_for="js:() => window.loaded === true"`.
7. **`screenshot`**, **`pdf`**, & **`capture_mhtml`**:
- If `True`, captures a screenshot, PDF, or MHTML snapshot after the page is fully loaded.
- The results go to `result.screenshot` (base64), `result.pdf` (bytes), or `result.mhtml` (string).
8. **Location Parameters**:
- **`locale`**: Browser's locale (e.g., `"en-US"`, `"fr-FR"`) for language preferences
- **`timezone_id`**: Browser's timezone (e.g., `"America/New_York"`, `"Europe/Paris"`)
- **`geolocation`**: GPS coordinates via `GeolocationConfig(latitude=48.8566, longitude=2.3522)`
9. **`verbose`**:
- Logs additional runtime details.
- Overlaps with the browser's verbosity if also set to `True` in `BrowserConfig`.
10. **`enable_rate_limiting`**:
- If `True`, enables rate limiting for batch processing.
- Requires `rate_limit_config` to be set.
11. **`memory_threshold_percent`**:
- The memory threshold (as a percentage) to monitor.
- If exceeded, the crawler will pause or slow down.
12. **`check_interval`**:
- The interval (in seconds) to check system resources.
- Affects how often memory and CPU usage are monitored.
13. **`max_session_permit`**:
- The maximum number of concurrent crawl sessions.
- Helps prevent overwhelming the system.
14. **`url_matcher`** & **`match_mode`**:
- Enable URL-specific configurations when used with `arun_many()`.
- Set `url_matcher` to patterns (glob, function, or list) to match specific URLs.
- Use `match_mode` (OR/AND) to control how multiple patterns combine.
15. **`display_mode`**:
- The display mode for progress information (`DETAILED`, `BRIEF`, etc.).
- Affects how much information is printed during the crawl.
### Helper Methods
The `clone()` method is particularly useful for creating variations of your crawler configuration:
```python
# Create a base configuration
base_config = CrawlerRunConfig(
cache_mode=CacheMode.ENABLED,
word_count_threshold=200,
wait_until="networkidle"
)
# Create variations for different use cases
stream_config = base_config.clone(
stream=True, # Enable streaming mode
cache_mode=CacheMode.BYPASS
)
debug_config = base_config.clone(
page_timeout=120000, # Longer timeout for debugging
verbose=True
)
```
The `clone()` method:
- Creates a new instance with all the same settings
- Updates only the specified parameters
- Leaves the original configuration unchanged
- Perfect for creating variations without repeating all parameters
## 3. LLMConfig Essentials
### Key fields to note
1. **`provider`**:
- Which LLM provider to use.
- Possible values are `"ollama/llama3","groq/llama3-70b-8192","groq/llama3-8b-8192", "openai/gpt-4o-mini" ,"openai/gpt-4o","openai/o1-mini","openai/o1-preview","openai/o3-mini","openai/o3-mini-high","anthropic/claude-3-haiku-20240307","anthropic/claude-3-opus-20240229","anthropic/claude-3-sonnet-20240229","anthropic/claude-3-5-sonnet-20240620","gemini/gemini-pro","gemini/gemini-1.5-pro","gemini/gemini-2.0-flash","gemini/gemini-2.0-flash-exp","gemini/gemini-2.0-flash-lite-preview-02-05","deepseek/deepseek-chat"`<br/>*(default: `"openai/gpt-4o-mini"`)*
2. **`api_token`**:
- Optional. When not provided explicitly, api_token will be read from environment variables based on provider. For example: If a gemini model is passed as provider then,`"GEMINI_API_KEY"` will be read from environment variables
- API token of LLM provider <br/> eg: `api_token = "gsk_1ClHGGJ7Lpn4WGybR7vNWGdyb3FY7zXEw3SCiy0BAVM9lL8CQv"`
- Environment variable - use with prefix "env:" <br/> eg:`api_token = "env: GROQ_API_KEY"`
3. **`base_url`**:
- If your provider has a custom endpoint
```python
llm_config = LLMConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENAI_API_KEY"))
```
## 4. Putting It All Together
In a typical scenario, you define **one** `BrowserConfig` for your crawler session, then create **one or more** `CrawlerRunConfig` & `LLMConfig` depending on each call's needs:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig, LLMContentFilter, DefaultMarkdownGenerator
from crawl4ai import JsonCssExtractionStrategy
async def main():
# 1) Browser config: headless, bigger viewport, no proxy
browser_conf = BrowserConfig(
headless=True,
viewport_width=1280,
viewport_height=720
)
# 2) Example extraction strategy
schema = {
"name": "Articles",
"baseSelector": "div.article",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
]
}
extraction = JsonCssExtractionStrategy(schema)
# 3) Example LLM content filtering
gemini_config = LLMConfig(
provider="gemini/gemini-1.5-pro",
api_token = "env:GEMINI_API_TOKEN"
)
# Initialize LLM filter with specific instruction
filter = LLMContentFilter(
llm_config=gemini_config, # or your preferred provider
instruction="""
Focus on extracting the core educational content.
Include:
- Key concepts and explanations
- Important code examples
- Essential technical details
Exclude:
- Navigation elements
- Sidebars
- Footer content
Format the output as clean markdown with proper code blocks and headers.
""",
chunk_token_threshold=500, # Adjust based on your needs
verbose=True
)
md_generator = DefaultMarkdownGenerator(
content_filter=filter,
options={"ignore_links": True}
)
# 4) Crawler run config: skip cache, use extraction
run_conf = CrawlerRunConfig(
markdown_generator=md_generator,
extraction_strategy=extraction,
cache_mode=CacheMode.BYPASS,
)
async with AsyncWebCrawler(config=browser_conf) as crawler:
# 4) Execute the crawl
result = await crawler.arun(url="https://example.com/news", config=run_conf)
if result.success:
print("Extracted content:", result.extracted_content)
else:
print("Error:", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
```
## 5. Next Steps
- [BrowserConfig, CrawlerRunConfig & LLMConfig Reference](../api/parameters.md)
- **Custom Hooks & Auth** (Inject JavaScript or handle login forms).
- **Session Management** (Re-use pages, preserve state across multiple calls).
- **Advanced Caching** (Fine-tune read/write cache modes).
## 6. Conclusion
## 1. **BrowserConfig** – Controlling the Browser
`BrowserConfig` focuses on **how** the browser is launched and behaves. This includes headless mode, proxies, user agents, and other environment tweaks.
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig
browser_cfg = BrowserConfig(
browser_type="chromium",
headless=True,
viewport_width=1280,
viewport_height=720,
proxy="http://user:pass@proxy:8080",
user_agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36",
)
```
## 1.1 Parameter Highlights
| **Parameter** | **Type / Default** | **What It Does** |
|-----------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------|
| **`browser_type`** | `"chromium"`, `"firefox"`, `"webkit"`<br/>*(default: `"chromium"`)* | Which browser engine to use. `"chromium"` is typical for many sites, `"firefox"` or `"webkit"` for specialized tests. |
| **`headless`** | `bool` (default: `True`) | Headless means no visible UI. `False` is handy for debugging. |
| **`viewport_width`** | `int` (default: `1080`) | Initial page width (in px). Useful for testing responsive layouts. |
| **`viewport_height`** | `int` (default: `600`) | Initial page height (in px). |
| **`proxy`** | `str` (deprecated) | Deprecated. Use `proxy_config` instead. If set, it will be auto-converted internally. |
| **`proxy_config`** | `dict` (default: `None`) | For advanced or multi-proxy needs, specify details like `{"server": "...", "username": "...", ...}`. |
| **`use_persistent_context`** | `bool` (default: `False`) | If `True`, uses a **persistent** browser context (keep cookies, sessions across runs). Also sets `use_managed_browser=True`. |
| **`user_data_dir`** | `str or None` (default: `None`) | Directory to store user data (profiles, cookies). Must be set if you want permanent sessions. |
| **`ignore_https_errors`** | `bool` (default: `True`) | If `True`, continues despite invalid certificates (common in dev/staging). |
| **`java_script_enabled`** | `bool` (default: `True`) | Disable if you want no JS overhead, or if only static content is needed. |
| **`cookies`** | `list` (default: `[]`) | Pre-set cookies, each a dict like `{"name": "session", "value": "...", "url": "..."}`. |
| **`headers`** | `dict` (default: `{}`) | Extra HTTP headers for every request, e.g. `{"Accept-Language": "en-US"}`. |
| **`user_agent`** | `str` (default: Chrome-based UA) | Your custom or random user agent. `user_agent_mode="random"` can shuffle it. |
| **`light_mode`** | `bool` (default: `False`) | Disables some background features for performance gains. |
| **`text_mode`** | `bool` (default: `False`) | If `True`, tries to disable images/other heavy content for speed. |
| **`use_managed_browser`** | `bool` (default: `False`) | For advanced “managed” interactions (debugging, CDP usage). Typically set automatically if persistent context is on. |
| **`extra_args`** | `list` (default: `[]`) | Additional flags for the underlying browser process, e.g. `["--disable-extensions"]`. |
- Set `headless=False` to visually **debug** how pages load or how interactions proceed.
- If you need **authentication** storage or repeated sessions, consider `use_persistent_context=True` and specify `user_data_dir`.
- For large pages, you might need a bigger `viewport_width` and `viewport_height` to handle dynamic content.
## 2. **CrawlerRunConfig** – Controlling Each Crawl
While `BrowserConfig` sets up the **environment**, `CrawlerRunConfig` details **how** each **crawl operation** should behave: caching, content filtering, link or domain blocking, timeouts, JavaScript code, etc.
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
run_cfg = CrawlerRunConfig(
wait_for="css:.main-content",
word_count_threshold=15,
excluded_tags=["nav", "footer"],
exclude_external_links=True,
stream=True, # Enable streaming for arun_many()
)
```
## 2.1 Parameter Highlights
### A) **Content Processing**
| **Parameter** | **Type / Default** | **What It Does** |
|------------------------------|--------------------------------------|-------------------------------------------------------------------------------------------------|
| **`word_count_threshold`** | `int` (default: ~200) | Skips text blocks below X words. Helps ignore trivial sections. |
| **`extraction_strategy`** | `ExtractionStrategy` (default: None) | If set, extracts structured data (CSS-based, LLM-based, etc.). |
| **`markdown_generator`** | `MarkdownGenerationStrategy` (None) | If you want specialized markdown output (citations, filtering, chunking, etc.). Can be customized with options such as `content_source` parameter to select the HTML input source ('cleaned_html', 'raw_html', or 'fit_html'). |
| **`css_selector`** | `str` (None) | Retains only the part of the page matching this selector. Affects the entire extraction process. |
| **`target_elements`** | `List[str]` (None) | List of CSS selectors for elements to focus on for markdown generation and data extraction, while still processing the entire page for links, media, etc. Provides more flexibility than `css_selector`. |
| **`excluded_tags`** | `list` (None) | Removes entire tags (e.g. `["script", "style"]`). |
| **`excluded_selector`** | `str` (None) | Like `css_selector` but to exclude. E.g. `"#ads, .tracker"`. |
| **`only_text`** | `bool` (False) | If `True`, tries to extract text-only content. |
| **`prettiify`** | `bool` (False) | If `True`, beautifies final HTML (slower, purely cosmetic). |
| **`keep_data_attributes`** | `bool` (False) | If `True`, preserve `data-*` attributes in cleaned HTML. |
| **`remove_forms`** | `bool` (False) | If `True`, remove all `<form>` elements. |
### B) **Caching & Session**
| **Parameter** | **Type / Default** | **What It Does** |
|-------------------------|------------------------|------------------------------------------------------------------------------------------------------------------------------|
| **`cache_mode`** | `CacheMode or None` | Controls how caching is handled (`ENABLED`, `BYPASS`, `DISABLED`, etc.). If `None`, typically defaults to `ENABLED`. |
| **`session_id`** | `str or None` | Assign a unique ID to reuse a single browser session across multiple `arun()` calls. |
| **`bypass_cache`** | `bool` (False) | If `True`, acts like `CacheMode.BYPASS`. |
| **`disable_cache`** | `bool` (False) | If `True`, acts like `CacheMode.DISABLED`. |
| **`no_cache_read`** | `bool` (False) | If `True`, acts like `CacheMode.WRITE_ONLY` (writes cache but never reads). |
| **`no_cache_write`** | `bool` (False) | If `True`, acts like `CacheMode.READ_ONLY` (reads cache but never writes). |
### C) **Page Navigation & Timing**
| **Parameter** | **Type / Default** | **What It Does** |
|----------------------------|-------------------------|----------------------------------------------------------------------------------------------------------------------|
| **`wait_until`** | `str` (domcontentloaded)| Condition for navigation to “complete”. Often `"networkidle"` or `"domcontentloaded"`. |
| **`page_timeout`** | `int` (60000 ms) | Timeout for page navigation or JS steps. Increase for slow sites. |
| **`wait_for`** | `str or None` | Wait for a CSS (`"css:selector"`) or JS (`"js:() => bool"`) condition before content extraction. |
| **`wait_for_images`** | `bool` (False) | Wait for images to load before finishing. Slows down if you only want text. |
| **`delay_before_return_html`** | `float` (0.1) | Additional pause (seconds) before final HTML is captured. Good for last-second updates. |
| **`check_robots_txt`** | `bool` (False) | Whether to check and respect robots.txt rules before crawling. If True, caches robots.txt for efficiency. |
| **`mean_delay`** and **`max_range`** | `float` (0.1, 0.3) | If you call `arun_many()`, these define random delay intervals between crawls, helping avoid detection or rate limits. |
| **`semaphore_count`** | `int` (5) | Max concurrency for `arun_many()`. Increase if you have resources for parallel crawls. |
### D) **Page Interaction**
| **Parameter** | **Type / Default** | **What It Does** |
|----------------------------|--------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
| **`js_code`** | `str or list[str]` (None) | JavaScript to run after load. E.g. `"document.querySelector('button')?.click();"`. |
| **`js_only`** | `bool` (False) | If `True`, indicates we’re reusing an existing session and only applying JS. No full reload. |
| **`ignore_body_visibility`** | `bool` (True) | Skip checking if `<body>` is visible. Usually best to keep `True`. |
| **`scan_full_page`** | `bool` (False) | If `True`, auto-scroll the page to load dynamic content (infinite scroll). |
| **`scroll_delay`** | `float` (0.2) | Delay between scroll steps if `scan_full_page=True`. |
| **`process_iframes`** | `bool` (False) | Inlines iframe content for single-page extraction. |
| **`remove_overlay_elements`** | `bool` (False) | Removes potential modals/popups blocking the main content. |
| **`simulate_user`** | `bool` (False) | Simulate user interactions (mouse movements) to avoid bot detection. |
| **`override_navigator`** | `bool` (False) | Override `navigator` properties in JS for stealth. |
| **`magic`** | `bool` (False) | Automatic handling of popups/consent banners. Experimental. |
| **`adjust_viewport_to_content`** | `bool` (False) | Resizes viewport to match page content height. |
If your page is a single-page app with repeated JS updates, set `js_only=True` in subsequent calls, plus a `session_id` for reusing the same tab.
### E) **Media Handling**
| **Parameter** | **Type / Default** | **What It Does** |
|--------------------------------------------|---------------------|-----------------------------------------------------------------------------------------------------------|
| **`screenshot`** | `bool` (False) | Capture a screenshot (base64) in `result.screenshot`. |
| **`screenshot_wait_for`** | `float or None` | Extra wait time before the screenshot. |
| **`screenshot_height_threshold`** | `int` (~20000) | If the page is taller than this, alternate screenshot strategies are used. |
| **`pdf`** | `bool` (False) | If `True`, returns a PDF in `result.pdf`. |
| **`capture_mhtml`** | `bool` (False) | If `True`, captures an MHTML snapshot of the page in `result.mhtml`. MHTML includes all page resources (CSS, images, etc.) in a single file. |
| **`image_description_min_word_threshold`** | `int` (~50) | Minimum words for an image’s alt text or description to be considered valid. |
| **`image_score_threshold`** | `int` (~3) | Filter out low-scoring images. The crawler scores images by relevance (size, context, etc.). |
| **`exclude_external_images`** | `bool` (False) | Exclude images from other domains. |
### F) **Link/Domain Handling**
| **Parameter** | **Type / Default** | **What It Does** |
|------------------------------|-------------------------|-----------------------------------------------------------------------------------------------------------------------------|
| **`exclude_social_media_domains`** | `list` (e.g. Facebook/Twitter) | A default list can be extended. Any link to these domains is removed from final output. |
| **`exclude_external_links`** | `bool` (False) | Removes all links pointing outside the current domain. |
| **`exclude_social_media_links`** | `bool` (False) | Strips links specifically to social sites (like Facebook or Twitter). |
| **`exclude_domains`** | `list` ([]) | Provide a custom list of domains to exclude (like `["ads.com", "trackers.io"]`). |
| **`preserve_https_for_internal_links`** | `bool` (False) | If `True`, preserves HTTPS scheme for internal links even when the server redirects to HTTP. Useful for security-conscious crawling. |
### G) **Debug & Logging**
| **Parameter** | **Type / Default** | **What It Does** |
|----------------|--------------------|---------------------------------------------------------------------------|
| **`verbose`** | `bool` (True) | Prints logs detailing each step of crawling, interactions, or errors. |
| **`log_console`** | `bool` (False) | Logs the page’s JavaScript console output if you want deeper JS debugging.|
### H) **Virtual Scroll Configuration**
| **Parameter** | **Type / Default** | **What It Does** |
|------------------------------|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|
| **`virtual_scroll_config`** | `VirtualScrollConfig or dict` (None) | Configuration for handling virtualized scrolling on sites like Twitter/Instagram where content is replaced rather than appended. |
When sites use virtual scrolling (content replaced as you scroll), use `VirtualScrollConfig`:
```python
from crawl4ai import VirtualScrollConfig
virtual_config = VirtualScrollConfig(
container_selector="#timeline", # CSS selector for scrollable container
scroll_count=30, # Number of times to scroll
scroll_by="container_height", # How much to scroll: "container_height", "page_height", or pixels (e.g. 500)
wait_after_scroll=0.5 # Seconds to wait after each scroll for content to load
)
config = CrawlerRunConfig(
virtual_scroll_config=virtual_config
)
```
**VirtualScrollConfig Parameters:**
| **Parameter** | **Type / Default** | **What It Does** |
|------------------------|---------------------------|-------------------------------------------------------------------------------------------|
| **`container_selector`** | `str` (required) | CSS selector for the scrollable container (e.g., `"#feed"`, `".timeline"`) |
| **`scroll_count`** | `int` (10) | Maximum number of scrolls to perform |
| **`scroll_by`** | `str or int` ("container_height") | Scroll amount: `"container_height"`, `"page_height"`, or pixels (e.g., `500`) |
| **`wait_after_scroll`** | `float` (0.5) | Time in seconds to wait after each scroll for new content to load |
- Use `virtual_scroll_config` when content is **replaced** during scroll (Twitter, Instagram)
- Use `scan_full_page` when content is **appended** during scroll (traditional infinite scroll)
### I) **URL Matching Configuration**
| **Parameter** | **Type / Default** | **What It Does** |
|------------------------|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|
| **`url_matcher`** | `UrlMatcher` (None) | Pattern(s) to match URLs against. Can be: string (glob), function, or list of mixed types. **None means match ALL URLs** |
| **`match_mode`** | `MatchMode` (MatchMode.OR) | How to combine multiple matchers in a list: `MatchMode.OR` (any match) or `MatchMode.AND` (all must match) |
The `url_matcher` parameter enables URL-specific configurations when used with `arun_many()`:
```python
from crawl4ai import CrawlerRunConfig, MatchMode
from crawl4ai.processors.pdf import PDFContentScrapingStrategy
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
# Simple string pattern (glob-style)
pdf_config = CrawlerRunConfig(
url_matcher="*.pdf",
scraping_strategy=PDFContentScrapingStrategy()
)
# Multiple patterns with OR logic (default)
blog_config = CrawlerRunConfig(
url_matcher=["*/blog/*", "*/article/*", "*/news/*"],
match_mode=MatchMode.OR # Any pattern matches
)
# Function matcher
api_config = CrawlerRunConfig(
url_matcher=lambda url: 'api' in url or url.endswith('.json'),
# Other settings like extraction_strategy
)
# Mixed: String + Function with AND logic
complex_config = CrawlerRunConfig(
url_matcher=[
lambda url: url.startswith('https://'), # Must be HTTPS
"*.org/*", # Must be .org domain
lambda url: 'docs' in url # Must contain 'docs'
],
match_mode=MatchMode.AND # ALL conditions must match
)
# Combined patterns and functions with AND logic
secure_docs = CrawlerRunConfig(
url_matcher=["https://*", lambda url: '.doc' in url],
match_mode=MatchMode.AND # Must be HTTPS AND contain .doc
)
# Default config - matches ALL URLs
default_config = CrawlerRunConfig() # No url_matcher = matches everything
```
**UrlMatcher Types:**
- **None (default)**: When `url_matcher` is None or not set, the config matches ALL URLs
- **String patterns**: Glob-style patterns like `"*.pdf"`, `"*/api/*"`, `"https://*.example.com/*"`
- **Functions**: `lambda url: bool` - Custom logic for complex matching
- **Lists**: Mix strings and functions, combined with `MatchMode.OR` or `MatchMode.AND`
**Important Behavior:**
- When passing a list of configs to `arun_many()`, URLs are matched against each config's `url_matcher` in order. First match wins!
- If no config matches a URL and there's no default config (one without `url_matcher`), the URL will fail with "No matching configuration found"
Both `BrowserConfig` and `CrawlerRunConfig` provide a `clone()` method to create modified copies:
```python
# Create a base configuration
base_config = CrawlerRunConfig(
cache_mode=CacheMode.ENABLED,
word_count_threshold=200
)
# Create variations using clone()
stream_config = base_config.clone(stream=True)
no_cache_config = base_config.clone(
cache_mode=CacheMode.BYPASS,
stream=True
)
```
The `clone()` method is particularly useful when you need slightly different configurations for different use cases, without modifying the original config.
## 2.3 Example Usage
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
async def main():
# Configure the browser
browser_cfg = BrowserConfig(
headless=False,
viewport_width=1280,
viewport_height=720,
proxy="http://user:pass@myproxy:8080",
text_mode=True
)
# Configure the run
run_cfg = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
session_id="my_session",
css_selector="main.article",
excluded_tags=["script", "style"],
exclude_external_links=True,
wait_for="css:.article-loaded",
screenshot=True,
stream=True
)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun(
url="https://example.com/news",
config=run_cfg
)
if result.success:
print("Final cleaned_html length:", len(result.cleaned_html))
if result.screenshot:
print("Screenshot captured (base64, length):", len(result.screenshot))
else:
print("Crawl failed:", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
```
## 2.4 Compliance & Ethics
| **Parameter** | **Type / Default** | **What It Does** |
|-----------------------|-------------------------|----------------------------------------------------------------------------------------------------------------------|
| **`check_robots_txt`**| `bool` (False) | When True, checks and respects robots.txt rules before crawling. Uses efficient caching with SQLite backend. |
| **`user_agent`** | `str` (None) | User agent string to identify your crawler. Used for robots.txt checking when enabled. |
```python
run_config = CrawlerRunConfig(
check_robots_txt=True, # Enable robots.txt compliance
user_agent="MyBot/1.0" # Identify your crawler
)
```
## 3. **LLMConfig** - Setting up LLM providers
1. LLMExtractionStrategy
2. LLMContentFilter
3. JsonCssExtractionStrategy.generate_schema
4. JsonXPathExtractionStrategy.generate_schema
## 3.1 Parameters
| **Parameter** | **Type / Default** | **What It Does** |
|-----------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------|
| **`provider`** | `"ollama/llama3","groq/llama3-70b-8192","groq/llama3-8b-8192", "openai/gpt-4o-mini" ,"openai/gpt-4o","openai/o1-mini","openai/o1-preview","openai/o3-mini","openai/o3-mini-high","anthropic/claude-3-haiku-20240307","anthropic/claude-3-opus-20240229","anthropic/claude-3-sonnet-20240229","anthropic/claude-3-5-sonnet-20240620","gemini/gemini-pro","gemini/gemini-1.5-pro","gemini/gemini-2.0-flash","gemini/gemini-2.0-flash-exp","gemini/gemini-2.0-flash-lite-preview-02-05","deepseek/deepseek-chat"`<br/>*(default: `"openai/gpt-4o-mini"`)* | Which LLM provider to use.
| **`api_token`** |1.Optional. When not provided explicitly, api_token will be read from environment variables based on provider. For example: If a gemini model is passed as provider then,`"GEMINI_API_KEY"` will be read from environment variables <br/> 2. API token of LLM provider <br/> eg: `api_token = "gsk_1ClHGGJ7Lpn4WGybR7vNWGdyb3FY7zXEw3SCiy0BAVM9lL8CQv"` <br/> 3. Environment variable - use with prefix "env:" <br/> eg:`api_token = "env: GROQ_API_KEY"` | API token to use for the given provider
| **`base_url`** |Optional. Custom API endpoint | If your provider has a custom endpoint
## 3.2 Example Usage
```python
llm_config = LLMConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENAI_API_KEY"))
```
## 4. Putting It All Together
- **Use** `BrowserConfig` for **global** browser settings: engine, headless, proxy, user agent.
- **Use** `CrawlerRunConfig` for each crawl’s **context**: how to filter content, handle caching, wait for dynamic elements, or run JS.
- **Pass** both configs to `AsyncWebCrawler` (the `BrowserConfig`) and then to `arun()` (the `CrawlerRunConfig`).
- **Use** `LLMConfig` for LLM provider configurations that can be used across all extraction, filtering, schema generation tasks. Can be used in - `LLMExtractionStrategy`, `LLMContentFilter`, `JsonCssExtractionStrategy.generate_schema` & `JsonXPathExtractionStrategy.generate_schema`
```python
# Create a modified copy with the clone() method
stream_cfg = run_cfg.clone(
stream=True,
cache_mode=CacheMode.BYPASS
)
```
## Crawling Patterns
## Simple Crawling
## Basic Usage
Set up a simple crawl using `BrowserConfig` and `CrawlerRunConfig`:
```python
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
async def main():
browser_config = BrowserConfig() # Default browser configuration
run_config = CrawlerRunConfig() # Default crawl run configuration
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://example.com",
config=run_config
)
print(result.markdown) # Print clean markdown content
if __name__ == "__main__":
asyncio.run(main())
```
## Understanding the Response
The `arun()` method returns a `CrawlResult` object with several useful properties. Here's a quick overview (see [CrawlResult](../api/crawl-result.md) for complete details):
```python
config = CrawlerRunConfig(
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.6),
options={"ignore_links": True}
)
)
result = await crawler.arun(
url="https://example.com",
config=config
)
# Different content formats
print(result.html) # Raw HTML
print(result.cleaned_html) # Cleaned HTML
print(result.markdown.raw_markdown) # Raw markdown from cleaned html
print(result.markdown.fit_markdown) # Most relevant content in markdown
# Check success status
print(result.success) # True if crawl succeeded
print(result.status_code) # HTTP status code (e.g., 200, 404)
# Access extracted media and links
print(result.media) # Dictionary of found media (images, videos, audio)
print(result.links) # Dictionary of internal and external links
```
## Adding Basic Options
Customize your crawl using `CrawlerRunConfig`:
```python
run_config = CrawlerRunConfig(
word_count_threshold=10, # Minimum words per content block
exclude_external_links=True, # Remove external links
remove_overlay_elements=True, # Remove popups/modals
process_iframes=True # Process iframe content
)
result = await crawler.arun(
url="https://example.com",
config=run_config
)
```
## Handling Errors
```python
run_config = CrawlerRunConfig()
result = await crawler.arun(url="https://example.com", config=run_config)
if not result.success:
print(f"Crawl failed: {result.error_message}")
print(f"Status code: {result.status_code}")
```
## Logging and Debugging
Enable verbose logging in `BrowserConfig`:
```python
browser_config = BrowserConfig(verbose=True)
async with AsyncWebCrawler(config=browser_config) as crawler:
run_config = CrawlerRunConfig()
result = await crawler.arun(url="https://example.com", config=run_config)
```
## Complete Example
```python
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig, CacheMode
async def main():
browser_config = BrowserConfig(verbose=True)
run_config = CrawlerRunConfig(
# Content filtering
word_count_threshold=10,
excluded_tags=['form', 'header'],
exclude_external_links=True,
# Content processing
process_iframes=True,
remove_overlay_elements=True,
# Cache control
cache_mode=CacheMode.ENABLED # Use cache if available
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://example.com",
config=run_config
)
if result.success:
# Print clean content
print("Content:", result.markdown[:500]) # First 500 chars
# Process images
for image in result.media["images"]:
print(f"Found image: {image['src']}")
# Process links
for link in result.links["internal"]:
print(f"Internal link: {link['href']}")
else:
print(f"Crawl failed: {result.error_message}")
if __name__ == "__main__":
asyncio.run(main())
```
## Content Processing
## Markdown Generation Basics
1. How to configure the **Default Markdown Generator**
2. The difference between raw markdown (`result.markdown`) and filtered markdown (`fit_markdown`)
> - You know how to configure `CrawlerRunConfig`.
## 1. Quick Example
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def main():
config = CrawlerRunConfig(
markdown_generator=DefaultMarkdownGenerator()
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com", config=config)
if result.success:
print("Raw Markdown Output:\n")
print(result.markdown) # The unfiltered markdown from the page
else:
print("Crawl failed:", result.error_message)
if __name__ == "__main__":
asyncio.run(main())
```
- `CrawlerRunConfig( markdown_generator = DefaultMarkdownGenerator() )` instructs Crawl4AI to convert the final HTML into markdown
... (truncated)
```