Back to skills
SkillHub ClubShip Full StackFull StackBackendTesting

web-scraping

This skill activates for web scraping and Actor development. It proactively discovers sitemaps/APIs, recommends optimal strategy (sitemap/API/Playwright/hybrid), and implements iteratively. For production, it guides TypeScript Actor creation via Apify CLI.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars
22
Hot score
88
Updated
March 20, 2026
Overall rating
C3.7
Composite score
3.7
Best-practice grade
C62.3

Install command

npx @skill-hub/cli install yfe404-web-scraping

Repository

yfe404/yfe404-web-scraping

This skill activates for web scraping and Actor development. It proactively discovers sitemaps/APIs, recommends optimal strategy (sitemap/API/Playwright/hybrid), and implements iteratively. For production, it guides TypeScript Actor creation via Apify CLI.

Open repository

Best for

Primary workflow: Ship Full Stack.

Technical facets: Full Stack, Backend, Testing.

Target audience: everyone.

License: MIT.

Original source

Catalog source: SkillHub Club.

Repository owner: yfe404.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

  • Install web-scraping into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
  • Review https://www.skillhub.club/skills/yfe404-web-scraping before adding web-scraping to shared team environments
  • Use web-scraping for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: web-scraping
description: This skill activates for web scraping and Actor development. It proactively discovers sitemaps/APIs, recommends optimal strategy (sitemap/API/Playwright/hybrid), and implements iteratively. For production, it guides TypeScript Actor creation via Apify CLI.
license: MIT
---

# Web Scraping with Intelligent Strategy Selection

## When This Skill Activates

Activate automatically when user requests:
- "Scrape [website]"
- "Extract data from [site]"
- "Get product information from [URL]"
- "Find all links/pages on [site]"
- "I'm getting blocked" or "Getting 403 errors" (loads `strategies/anti-blocking.md`)
- "Make this an Apify Actor" (loads `apify/` subdirectory)
- "Productionize this scraper"

## Proactive Workflow

This skill follows a systematic 5-phase approach to web scraping, always starting with interactive reconnaissance and ending with production-ready code.

### Phase 1: INTERACTIVE RECONNAISSANCE (Critical First Step)

When user says "scrape X", **immediately start with hands-on reconnaissance** using MCP tools:

**DO NOT jump to automated checks or implementation** - reconnaissance prevents wasted effort and discovers hidden APIs.

#### Use Playwright MCP & Chrome DevTools MCP:

**1. Open site in real browser** (Playwright MCP)
   - Navigate like a real user
   - Observe page loading behavior (SSR? SPA? Loading states?)
   - Take screenshots for reference
   - Test basic interactions

**2. Monitor network traffic** (Chrome DevTools via Playwright)
   - Watch XHR/Fetch requests in real-time
   - **Find API endpoints** returning JSON (10-100x faster than HTML scraping!)
   - Analyze request/response patterns
   - Document headers, cookies, authentication tokens
   - Extract pagination parameters

**3. Test site interactions**
   - **Pagination**: URL-based? API? Infinite scroll?
   - **Filtering and search**: How do they work?
   - **Dynamic content loading**: Triggers and patterns
   - **Authentication flows**: Required? Optional?

**4. Assess protection mechanisms**
   - Cloudflare/bot detection
   - CAPTCHA requirements
   - Rate limiting behavior (test with multiple requests)
   - Fingerprinting scripts

**5. Generate Intelligence Report**
   - Site architecture (framework, rendering method)
   - **Discovered APIs/endpoints** with full specs
   - Protection mechanisms and required countermeasures
   - **Optimal extraction strategy** (API > Sitemap > HTML)
   - Time/complexity estimates

**See**: `workflows/reconnaissance.md` for complete reconnaissance guide with MCP examples

**Why this matters**: Reconnaissance discovers hidden APIs (eliminating need for HTML scraping), identifies blockers before coding, and provides intelligence for optimal strategy selection. **Never skip this step.**

### Phase 2: AUTOMATIC DISCOVERY (Validate Reconnaissance)

After Phase 1 reconnaissance, **validate findings with automated checks**:

#### 1. Check for Sitemaps

```bash
# Automatically check these locations
curl -s https://[site]/robots.txt | grep -i Sitemap
curl -I https://[site]/sitemap.xml
curl -I https://[site]/sitemap_index.xml
```

**Log findings clearly**:
- ✓ "Found sitemap at /sitemap.xml with ~1,234 URLs"
- ✓ "Found sitemap index with 5 sub-sitemaps"
- ✗ "No sitemap detected at common locations"

**Why this matters**: Sitemaps provide instant URL discovery (60x faster than crawling)

#### 2. Investigate APIs

**Prompt user**:
```
Should I check for JSON APIs first? (Highly recommended)

Benefits of APIs vs HTML scraping:
• 10-100x faster execution
• More reliable (structured JSON vs fragile HTML)
• Less bandwidth usage
• Easier to maintain

Check for APIs? [Y/n]
```

**If yes**, guide user:
1. Open browser DevTools → Network tab
2. Navigate the target website
3. Look for XHR/Fetch requests
4. Check for endpoints: `/api/`, `/v1/`, `/v2/`, `/graphql`, `/_next/data/`
5. Analyze request/response format (JSON, GraphQL, REST)

**Log findings**:
- ✓ "Found API: GET /api/products/{id} (returns JSON)"
- ✓ "Found GraphQL endpoint: /graphql"
- ✗ "No obvious public APIs detected"

#### 3. Analyze Site Structure

**Automatically assess**:
- JavaScript-heavy? (Look for React, Vue, Angular indicators)
- Authentication required? (Login walls, auth tokens)
- Page count estimate (from sitemap or site exploration)
- Rate limiting indicators (robots.txt directives)

### Phase 3: STRATEGY RECOMMENDATION

Based on Phases 1-2 findings, present 2-3 options with clear reasoning:

#### Example Output Template:

```
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📊 Analysis of example.com
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Phase 1 Intelligence (Reconnaissance):
✓ API discovered via DevTools: GET /api/products?page=N&limit=100
✓ Framework: Next.js (SSR + CSR hybrid)
✓ Protection: Cloudflare detected, rate limit ~60/min
✗ No authentication required

Phase 2 Validation:
✓ Sitemap found: 1,234 product URLs (validates API total)
✓ Static HTML fallback available if needed

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Recommended Approaches:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

⭐ Option 1: Hybrid (Sitemap + API) [RECOMMENDED]
   ✓ Use sitemap to get all 1,234 product URLs instantly
   ✓ Extract product IDs from URLs
   ✓ Fetch data via API (fast, reliable JSON)

   Estimated time: 8-12 minutes
   Complexity: Low-Medium
   Data quality: Excellent
   Speed: Very Fast

⚡ Option 2: Sitemap + Playwright
   ✓ Use sitemap for URLs
   ✓ Scrape HTML with Playwright

   Estimated time: 15-20 minutes
   Complexity: Medium
   Data quality: Good
   Speed: Fast

🔧 Option 3: Pure API (if sitemap fails)
   ✓ Discover product IDs through API exploration
   ✓ Fetch all data via API

   Estimated time: 10-15 minutes
   Complexity: Medium
   Data quality: Excellent
   Speed: Fast

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
My Recommendation: Option 1 (Hybrid)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Reasoning:
• Sitemap gives us complete URL list (instant discovery)
• API provides clean, structured data (no HTML parsing)
• Combines speed of sitemap with reliability of API
• Best of both worlds

Proceed with Option 1? [Y/n]
```

**Key principles**:
- Always recommend the SIMPLEST approach that works
- Sitemap > API > Playwright (in terms of simplicity)
- Show time estimates and complexity
- Explain reasoning clearly

### Phase 4: ITERATIVE IMPLEMENTATION

Implement scraper incrementally, starting simple and adding complexity only as needed.

**Core Pattern**:
1. Implement recommended approach (minimal code)
2. Test with small batch (5-10 items)
3. Validate data quality
4. Scale to full dataset or fallback
5. Handle blocking if encountered
6. Add robustness (error handling, retries, logging)

**See**: `workflows/implementation.md` for complete implementation patterns and code examples

### Phase 5: PRODUCTIONIZATION (On Request)

Convert scraper to production-ready Apify Actor.

**Activation triggers**:
- "Make this an Apify Actor"
- "Productionize this scraper"
- "Deploy to Apify"
- "Create an actor from this"

**Core Pattern**:
1. Confirm TypeScript preference (STRONGLY RECOMMENDED)
2. Initialize with `apify create` command (CRITICAL)
3. Port scraping logic to Actor format
4. Test locally and deploy

**See**: `workflows/productionization.md` for complete productionization workflow and `apify/` directory for all Actor development guides

## Quick Reference

| Task | Pattern/Command | Documentation |
|------|----------------|---------------|
| **Reconnaissance** | **Playwright + DevTools MCP** | **`workflows/reconnaissance.md`** |
| Find sitemaps | `RobotsFile.find(url)` | `strategies/sitemap-discovery.md` |
| Filter sitemap URLs | `RequestList + regex` | `reference/regex-patterns.md` |
| Discover APIs | DevTools → Network tab | `strategies/api-discovery.md` |
| Playwright scraping | `PlaywrightCrawler` | `strategies/playwright-scraping.md` |
| HTTP scraping | `CheerioCrawler` | `strategies/cheerio-scraping.md` |
| Hybrid approach | Sitemap + API | `strategies/hybrid-approaches.md` |
| Handle blocking | fingerprint-suite + proxies | `strategies/anti-blocking.md` |
| Fingerprint configs | Quick patterns | `reference/fingerprint-patterns.md` |
| Create Apify Actor | `apify create` | `apify/cli-workflow.md` |
| Template selection | Cheerio vs Playwright | `workflows/productionization.md` |
| Input schema | `.actor/input_schema.json` | `apify/input-schemas.md` |
| Deploy actor | `apify push` | `apify/deployment.md` |

## Common Patterns

### Pattern 1: Sitemap-Based Scraping

```javascript
import { RobotsFile, PlaywrightCrawler, Dataset } from 'crawlee';

// Auto-discover and parse sitemaps
const robots = await RobotsFile.find('https://example.com');
const urls = await robots.parseUrlsFromSitemaps();

const crawler = new PlaywrightCrawler({
    async requestHandler({ page, request }) {
        const data = await page.evaluate(() => ({
            title: document.title,
            // ... extract data
        }));
        await Dataset.pushData(data);
    },
});

await crawler.addRequests(urls);
await crawler.run();
```

See `examples/sitemap-basic.js` for complete example.

### Pattern 2: API-Based Scraping

```javascript
import { gotScraping } from 'got-scraping';

const productIds = [123, 456, 789];

for (const id of productIds) {
    const response = await gotScraping({
        url: `https://api.example.com/products/${id}`,
        responseType: 'json',
    });

    console.log(response.body);
}
```

See `examples/api-scraper.js` for complete example.

### Pattern 3: Hybrid (Sitemap + API)

```javascript
// Get URLs from sitemap
const robots = await RobotsFile.find('https://shop.com');
const urls = await robots.parseUrlsFromSitemaps();

// Extract IDs from URLs
const productIds = urls
    .map(url => url.match(/\/products\/(\d+)/)?.[1])
    .filter(Boolean);

// Fetch data via API
for (const id of productIds) {
    const data = await gotScraping({
        url: `https://api.shop.com/v1/products/${id}`,
        responseType: 'json',
    });
    // Process data
}
```

See `examples/hybrid-sitemap-api.js` for complete example.

## Directory Navigation

This skill uses **progressive disclosure** - detailed information is organized in subdirectories and loaded only when needed.

### Workflows (Implementation Patterns)
**For**: Step-by-step workflow guides for each phase

- `workflows/reconnaissance.md` - **Phase 1 interactive reconnaissance (CRITICAL)**
- `workflows/implementation.md` - Phase 4 iterative implementation patterns
- `workflows/productionization.md` - Phase 5 Apify Actor creation workflow

### Strategies (Deep Dives)
**For**: Detailed guides on specific scraping approaches

- `strategies/sitemap-discovery.md` - Complete sitemap guide (4 patterns)
- `strategies/api-discovery.md` - Finding and using APIs
- `strategies/playwright-scraping.md` - Browser-based scraping
- `strategies/cheerio-scraping.md` - HTTP-only scraping
- `strategies/hybrid-approaches.md` - Combining strategies
- `strategies/anti-blocking.md` - Fingerprinting & proxies for blocked sites

### Examples (Runnable Code)
**For**: Working code to reference or execute

**JavaScript Learning Examples** (Simple standalone scripts):
- `examples/sitemap-basic.js` - Simple sitemap scraper
- `examples/api-scraper.js` - Pure API approach
- `examples/playwright-basic.js` - Basic Playwright scraper
- `examples/hybrid-sitemap-api.js` - Combined approach
- `examples/iterative-fallback.js` - Try sitemap→API→Playwright

**TypeScript Production Examples** (Complete Actors):
- `apify/examples/basic-scraper/` - Sitemap + Playwright
- `apify/examples/anti-blocking/` - Fingerprinting + proxies
- `apify/examples/hybrid-api/` - Sitemap + API (optimal)

### Reference (Quick Lookup)
**For**: Quick patterns and troubleshooting

- `reference/regex-patterns.md` - Common URL regex patterns
- `reference/selector-guide.md` - Playwright selector strategies
- `reference/fingerprint-patterns.md` - Common fingerprint configurations
- `reference/anti-patterns.md` - What NOT to do

### Apify (Production Deployment)
**For**: Creating production Apify Actors

- `apify/README.md` - When and how to use Apify
- `apify/typescript-first.md` - **Why TypeScript for actors**
- `apify/cli-workflow.md` - **apify create workflow (CRITICAL)**
- `apify/initialization.md` - Complete setup guide
- `apify/input-schemas.md` - Input validation patterns
- `apify/configuration.md` - actor.json setup
- `apify/deployment.md` - Testing and deployment
- `apify/templates/` - TypeScript boilerplate

**Note**: Each file is self-contained and can be read independently. Claude will navigate to specific files as needed.

## Core Principles

### 1. Progressive Enhancement
Start with the simplest approach that works:
- Sitemap > API > Playwright
- Static > Dynamic
- HTTP > Browser

### 2. Proactive Discovery
Always investigate before implementing:
- Check for sitemaps automatically
- Look for APIs (ask user to check DevTools)
- Analyze site structure

### 3. Iterative Implementation
Build incrementally:
- Small test batch first (5-10 items)
- Validate quality
- Scale or fallback
- Add robustness last

### 4. Production-Ready Code
When productionizing:
- Use TypeScript (strongly recommended)
- Use `apify create` (never manual setup)
- Add proper error handling
- Include logging and monitoring

---

**Remember**: Sitemaps first, APIs second, scraping last!

For detailed guidance on any topic, navigate to the relevant subdirectory file listed above.


---

## Referenced Files

> The following files are referenced in this skill and included for context.

### strategies/anti-blocking.md

```markdown
# Anti-Blocking & Fingerprinting

## Overview

When websites detect automated scraping, they deploy anti-bot measures. This guide covers Apify's fingerprint-suite and techniques to bypass blocking while respecting ethical scraping practices.

## When You Need Anti-Blocking

### Signs You're Being Blocked

- **403 Forbidden** errors
- **Cloudflare challenges** ("Checking your browser...")
- **Bot detection** messages ("Access Denied", "Unusual traffic")
- **CAPTCHAs** appearing unexpectedly
- **Rate limiting** (429 Too Many Requests)
- **Empty responses** or incomplete data
- **Timeouts** or connection resets
- Pages return different content than browser shows

### Escalation Strategy

Try solutions in this order:

1. **Slow down** → Reduce request rate
2. **Add headers** → Use realistic User-Agent, headers
3. **Fingerprinting** → Use fingerprint-suite
4. **Proxies** → Datacenter → Residential
5. **Session rotation** → Rotate browser sessions
6. **Advanced** → CAPTCHA solving (ethical considerations)

## The fingerprint-suite

Apify's **fingerprint-suite** generates and injects realistic browser fingerprints to make your scraper appear as a real browser.

### What is Browser Fingerprinting?

Websites collect browser characteristics:
- User-Agent header
- Screen resolution
- Installed fonts
- Canvas fingerprint
- WebGL renderer
- Timezone & language
- Installed plugins
- Hardware concurrency

Bots typically have **inconsistent fingerprints**. fingerprint-suite generates **consistent, realistic fingerprints** that match real browsers.

### Components

| Package | Purpose |
|---------|---------|
| **header-generator** | Generates realistic HTTP headers |
| **fingerprint-generator** | Generates full browser fingerprints (headers + JS APIs) |
| **fingerprint-injector** | Injects fingerprints into Playwright/Puppeteer |
| **generative-bayesian-network** | ML model for realistic fingerprint generation |

## Quick Setup

### Method 1: Crawlee with FingerprintOptions (Easiest)

```typescript
import { Actor } from 'apify';
import { PlaywrightCrawler } from 'crawlee';

await Actor.main(async () => {
    const crawler = new PlaywrightCrawler({
        // Enable automatic fingerprinting
        useSessionPool: true,
        sessionPoolOptions: {
            maxPoolSize: 50,
        },

        // Configure fingerprint generation
        fingerprintOptions: {
            devices: ['desktop'],              // desktop, mobile, or both
            operatingSystems: ['windows'],      // windows, macos, linux, ios, android
            browsers: ['chrome'],               // chrome, firefox, safari, edge
        },

        // Add proxies (highly recommended with fingerprinting)
        proxyConfiguration: await Actor.createProxyConfiguration({
            groups: ['RESIDENTIAL'],
        }),

        async requestHandler({ page, request, log }) {
            log.info(`Scraping: ${request.url}`);
            // Your scraping logic
        },
    });

    await crawler.run(['https://example.com']);
});
```

**Benefits**:
- Automatic fingerprint generation per session
- Session management built-in
- Works with all Crawlee crawlers

### Method 2: Playwright with fingerprint-injector

```typescript
import { chromium } from 'playwright';
import { newInjectedContext } from 'fingerprint-injector';

const browser = await chromium.launch({ headless: true });

// Create context with injected fingerprint
const context = await newInjectedContext(browser, {
    fingerprintOptions: {
        devices: ['desktop'],
        operatingSystems: ['windows', 'macos'],
        browsers: ['chrome'],
    },
    newContextOptions: {
        // Playwright context options
        locale: 'en-US',
        timezoneId: 'America/New_York',
    },
});

const page = await context.newPage();
await page.goto('https://example.com');
// Scrape with realistic fingerprint
```

### Method 3: Puppeteer with fingerprint-injector

```typescript
import puppeteer from 'puppeteer';
import { newInjectedPage } from 'fingerprint-injector';

const browser = await puppeteer.launch({ headless: true });

const page = await newInjectedPage(browser, {
    fingerprintOptions: {
        devices: ['mobile'],
        operatingSystems: ['android'],
    },
});

await page.goto('https://example.com');
// Scrape with mobile fingerprint
```

## Fingerprint Configuration

### Device Types

```typescript
// Desktop browsers
fingerprintOptions: {
    devices: ['desktop'],
}

// Mobile devices
fingerprintOptions: {
    devices: ['mobile'],
}

// Both (random selection)
fingerprintOptions: {
    devices: ['desktop', 'mobile'],
}
```

### Operating Systems

```typescript
// Windows + Mac (desktop)
fingerprintOptions: {
    devices: ['desktop'],
    operatingSystems: ['windows', 'macos'],
}

// iOS (mobile)
fingerprintOptions: {
    devices: ['mobile'],
    operatingSystems: ['ios'],
}

// Android (mobile)
fingerprintOptions: {
    devices: ['mobile'],
    operatingSystems: ['android'],
}
```

### Browsers

```typescript
// Chrome only (most common)
fingerprintOptions: {
    browsers: ['chrome'],
}

// Chrome + Firefox
fingerprintOptions: {
    browsers: ['chrome', 'firefox'],
}

// All browsers (random selection)
fingerprintOptions: {
    browsers: ['chrome', 'firefox', 'safari', 'edge'],
}
```

See `../reference/fingerprint-patterns.md` for more configurations.

## Proxy Configuration

**Fingerprinting alone is often not enough** - combine with proxies for best results.

### Datacenter Proxies (Faster, Cheaper)

```typescript
const proxyConfiguration = await Actor.createProxyConfiguration({
    groups: ['SHADER'],  // Apify datacenter proxies
});

const crawler = new PlaywrightCrawler({
    proxyConfiguration,
    fingerprintOptions: { devices: ['desktop'] },
    // ...
});
```

### Residential Proxies (More Reliable)

```typescript
const proxyConfiguration = await Actor.createProxyConfiguration({
    groups: ['RESIDENTIAL'],
});

const crawler = new PlaywrightCrawler({
    proxyConfiguration,
    fingerprintOptions: { devices: ['desktop'] },
    // ...
});
```

### Custom Proxies

```typescript
const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        'http://user:[email protected]:8000',
        'http://user:[email protected]:8000',
    ],
});
```

## Session Management

Sessions group requests that should appear to come from the same "user".

```typescript
const crawler = new PlaywrightCrawler({
    useSessionPool: true,
    sessionPoolOptions: {
        maxPoolSize: 50,                    // Max concurrent sessions
        sessionOptions: {
            maxUsageCount: 50,              // Requests per session before rotation
            maxErrorScore: 3,                // Retire session after 3 errors
        },
    },

    async requestHandler({ session, request }) {
        // All requests in this session share:
        // - Same fingerprint
        // - Same proxy IP
        // - Same cookies
        console.log(`Session ID: ${session.id}`);
    },
});
```

## Complete Example: Anti-Blocking Scraper

```typescript
import { Actor } from 'apify';
import { PlaywrightCrawler, Dataset } from 'crawlee';

await Actor.main(async () => {
    const crawler = new PlaywrightCrawler({
        // Slow down to avoid rate limiting
        maxConcurrency: 3,
        maxRequestsPerMinute: 30,

        // Enable sessions with fingerprinting
        useSessionPool: true,
        sessionPoolOptions: {
            maxPoolSize: 20,
            sessionOptions: {
                maxUsageCount: 30,
            },
        },

        // Generate realistic fingerprints
        fingerprintOptions: {
            devices: ['desktop'],
            operatingSystems: ['windows', 'macos'],
            browsers: ['chrome'],
        },

        // Use residential proxies
        proxyConfiguration: await Actor.createProxyConfiguration({
            groups: ['RESIDENTIAL'],
        }),

        // Additional stealth
        preNavigationHooks: [
            async ({ page }) => {
                // Block unnecessary resources
                await page.route('**/*', (route) => {
                    const resourceType = route.request().resourceType();
                    if (['image', 'font', 'media'].includes(resourceType)) {
                        route.abort();
                    } else {
                        route.continue();
                    }
                });
            },
        ],

        async requestHandler({ page, request, session, log }) {
            log.info(`Scraping: ${request.url} (Session: ${session.id})`);

            try {
                // Wait for content
                await page.waitForSelector('body', { timeout: 10000 });

                // Check for blocking
                const isBlocked = await page.evaluate(() => {
                    const text = document.body.textContent?.toLowerCase() || '';
                    return text.includes('access denied') ||
                           text.includes('cloudflare') ||
                           text.includes('captcha');
                });

                if (isBlocked) {
                    log.warning('Detected blocking, retiring session');
                    session.retire();
                    throw new Error('Blocked');
                }

                // Extract data
                const data = await page.evaluate(() => ({
                    title: document.querySelector('h1')?.textContent,
                    // ...
                }));

                await Dataset.pushData(data);

                // Mark session as working
                session.markGood();

            } catch (error) {
                log.error(`Error: ${error.message}`);
                session.markBad();
                throw error; // Retry
            }
        },

        failedRequestHandler({ request, session }, { log }) {
            log.error(`Request failed after retries: ${request.url}`);
            session?.retire();
        },
    });

    await crawler.run(['https://example.com']);
});
```

## Troubleshooting

### Still Getting Blocked?

Try escalating:

1. **Slow down more**
   ```typescript
   maxConcurrency: 1,
   maxRequestsPerMinute: 10,
   ```

2. **Change fingerprint constraints**
   ```typescript
   fingerprintOptions: {
       devices: ['mobile'],  // Try mobile instead of desktop
       operatingSystems: ['ios'],
   }
   ```

3. **Upgrade proxies**
   ```typescript
   // From datacenter → residential
   groups: ['RESIDENTIAL']
   ```

4. **Rotate sessions more**
   ```typescript
   sessionOptions: {
       maxUsageCount: 10,  // Rotate after just 10 requests
   }
   ```

5. **Add delays**
   ```typescript
   import { setTimeout } from 'timers/promises';

   async requestHandler({ page }) {
       // Random delay between actions
       await setTimeout(Math.random() * 2000 + 1000); // 1-3 seconds
   }
   ```

### Detecting Fingerprint Issues

Test your fingerprint:

```typescript
const page = await context.newPage();
await page.goto('https://browserleaks.com/canvas');
// Check if fingerprint looks realistic

await page.goto('https://bot.sannysoft.com/');
// Check bot detection tests
```

### Common Issues

**Issue**: "Fingerprint doesn't match browser"

**Solution**: Ensure fingerprintOptions match actual browser:
```typescript
// If using chromium
fingerprintOptions: {
    browsers: ['chrome'],  // Not firefox!
}
```

**Issue**: "Proxy connection failed"

**Solution**: Verify proxy configuration:
```typescript
const proxyUrl = await proxyConfiguration.newUrl();
console.log('Testing proxy:', proxyUrl);
```

**Issue**: "Session pool exhausted"

**Solution**: Increase pool size:
```typescript
sessionPoolOptions: {
    maxPoolSize: 100,  // Increase from 50
}
```

## Best Practices

### ✅ DO:

- **Use fingerprinting from the start** on strict sites
- **Combine with proxies** (fingerprints alone often insufficient)
- **Match fingerprint to browser** (Chrome fingerprint with Chromium browser)
- **Rotate sessions** after errors or blocks
- **Monitor session health** (markGood/markBad)
- **Test fingerprints** on bot detection sites first
- **Respect robots.txt** even with anti-blocking
- **Use residential proxies** for strict sites

### ❌ DON'T:

- **Rely on fingerprints alone** - always use proxies too
- **Use mismatched configs** (iOS fingerprint with desktop browser)
- **Ignore session errors** - retire bad sessions
- **Scrape too fast** even with fingerprints
- **Use free proxies** - they're usually detected
- **Bypass CAPTCHAs** without permission
- **Violate ToS** - anti-blocking ≠ permission to scrape

## Ethical Considerations

**Anti-blocking is a tool, not permission**:

- ✅ Use for public data collection
- ✅ Respect rate limits (even if you can bypass them)
- ✅ Honor robots.txt
- ✅ Don't overload servers
- ❌ Don't bypass paywalls
- ❌ Don't scrape private data
- ❌ Don't violate terms of service

## Performance Impact

| Technique | Speed Impact | Detection Evasion |
|-----------|--------------|-------------------|
| No anti-blocking | Fastest | ❌ High detection |
| Headers only | Fast | ⚠️ Medium detection |
| Fingerprinting | Medium | ✅ Low detection |
| Fingerprinting + Proxies | Medium-Slow | ✅✅ Very low detection |
| + Session rotation | Slower | ✅✅✅ Minimal detection |

**Trade-off**: More anti-blocking = Slower but more reliable

## Resources

- [fingerprint-suite GitHub](https://github.com/apify/fingerprint-suite)
- [Apify Anti-Scraping Academy](https://docs.apify.com/academy/anti-scraping)
- [Crawlee FingerprintOptions](https://crawlee.dev/api/browser-pool/interface/FingerprintOptions)
- [fingerprint-injector npm](https://www.npmjs.com/package/fingerprint-injector)
- [Apify Proxy Docs](https://docs.apify.com/platform/proxy)

## Related

- **Fingerprint patterns**: See `../reference/fingerprint-patterns.md`
- **Proxy configuration**: See Apify docs
- **Session management**: See Crawlee docs

## Summary

**Anti-blocking is essential for scraping strict sites**

**Key steps**:
1. Use fingerprint-suite with Crawlee (easiest)
2. Configure realistic fingerprints (match target device/OS)
3. Add proxies (residential recommended)
4. Enable session rotation
5. Monitor and retire bad sessions
6. Slow down if still blocked

**Remember**: Anti-blocking enables scraping, but doesn't grant permission. Always scrape ethically and legally.

```

### workflows/reconnaissance.md

```markdown
# Phase 1: Interactive Reconnaissance

Critical intelligence gathering before any scraping implementation.

## Why Reconnaissance First?

**The current problem**: Most scraping projects start with guesswork:
- "Try the sitemap" → Maybe it doesn't have all data
- "Scrape HTML" → Slow and brittle
- "Use Playwright" → Overkill if APIs exist

**The reconnaissance solution**:
- Discover hidden APIs visible only in browser DevTools (10-100x faster than HTML)
- Understand site architecture before writing code
- Detect anti-bot measures early and plan countermeasures
- Find optimal data extraction points
- Save hours of wasted implementation effort

## MCP Tools Required

This phase requires:
- **Playwright MCP**: Browser automation for real user interaction
- **Chrome DevTools MCP** (via Playwright): Network monitoring, console analysis

Both are available through Claude's MCP integration when Playwright MCP server is configured.

---

## Step 1.1: Initialize Browser Session

### Open Target Site

Start by opening the site in a real browser to observe its behavior:

```typescript
// Navigate to target site
await playwright_navigate({
  url: "https://target-site.com",
  headless: false,  // Visual inspection important for first pass
  waitUntil: "networkidle"  // Wait for all network requests
});

// Capture initial state
await playwright_screenshot({
  name: "homepage-initial",
  fullPage: true,
  savePng: true
});
```

### Observe Loading Behavior

**Look for**:
- **Immediate content**: Page loads fully on first request → Likely SSR/static
- **Loading spinners**: Content loads after initial paint → JavaScript-rendered
- **Skeleton screens**: Placeholder UI → API-driven dynamic content
- **Popups/banners**: Cookie consent, newsletters → Need to dismiss before exploration

**Document findings**:
```
Initial Load Observation:
- Page type: [Static/SSR/SPA]
- Loading pattern: [Immediate/Progressive/Delayed]
- Interstitials: [Cookie banner, newsletter popup]
```

---

## Step 1.2: Network Traffic Analysis

### Monitor All Network Requests

**Critical**: This reveals the "invisible" data layer that drives the site.

```typescript
// Start network monitoring (Playwright automatically captures this)
await playwright_navigate({
  url: "https://target-site.com/products"
});

// Navigate through key pages while monitoring
await playwright_click({ selector: ".category-link" });
await playwright_wait({ timeout: 2000 });

await playwright_click({ selector: ".product-item:first-child" });
await playwright_wait({ timeout: 2000 });

// Retrieve console logs (includes network activity logged by site)
const logs = await playwright_console_logs({
  type: "all",
  limit: 100
});
```

### Analyze Network Patterns

Use browser DevTools (manually or via Playwright) to inspect:

**API Endpoints to Look For**:
- `/api/v{N}/...` - Versioned REST APIs
- `/graphql` - GraphQL endpoints
- `/_next/data/...` - Next.js data endpoints
- `/wp-json/...` - WordPress REST API
- `/ajax/...` - Legacy AJAX endpoints
- `/__data.json` - SvelteKit data

**What to Extract**:
```
Discovered Endpoints:
✅ GET /api/v2/products?page={n}&limit={m}
   Request: page=1, limit=20
   Response: JSON array of products
   Auth: None required
   Rate limit: Unknown (test needed)

✅ GET /api/v2/products/{id}
   Response: Detailed product JSON
   Fields: id, name, price, description, images, stock
```

### Inspect Request/Response Details

For each discovered endpoint:

```typescript
// Test API endpoint directly
await playwright_evaluate({
  script: `
    fetch('/api/v2/products?page=1&limit=5')
      .then(r => r.json())
      .then(data => console.log('API_DATA:', JSON.stringify(data)))
      .catch(e => console.log('API_ERROR:', e.message));
  `
});

// Check console for output
const apiLogs = await playwright_console_logs({
  search: "API_DATA",
  limit: 1
});
```

**Document**:
- Request method (GET/POST)
- Required headers (authorization, content-type)
- Query parameters (pagination, filters)
- Response structure
- Authentication requirements

---

## Step 1.3: Site Structure Discovery

### Test Pagination Mechanisms

**Pagination Type Detection**:

```typescript
// Test pagination clicks
await playwright_click({ selector: ".next-page" });
await playwright_wait({ timeout: 1000 });

// Check if URL changed
const currentUrl = await playwright_evaluate({
  script: "window.location.href"
});

// Check if content was replaced or appended
const itemCount = await playwright_evaluate({
  script: "document.querySelectorAll('.product-item').length"
});
```

**Pagination Patterns**:
1. **URL-based**: `?page=2` or `/page/2/`
   - Easy to iterate
   - Can directly construct URLs

2. **API-based**: XHR with `offset`/`cursor` parameters
   - Check DevTools Network tab
   - Extract pagination parameters

3. **Infinite scroll**: Content appends on scroll
   - Need to trigger scroll events
   - Watch for API calls

**Document**:
```
Pagination:
- Type: [URL-based / API-based / Infinite scroll]
- Parameter: page=N or offset=N or cursor=TOKEN
- Items per page: 20
- Total pages: ~250 (estimated from last page)
```

### Test Filtering and Search

```typescript
// Test filter selection
await playwright_click({ selector: ".filter-category" });
await playwright_wait({ timeout: 1000 });

// Observe URL or API changes
// Check DevTools Network tab for XHR/Fetch

// Test search functionality
await playwright_fill({
  selector: "input[name='search']",
  value: "test query"
});

await playwright_click({ selector: "button[type='submit']" });
```

**Look for**:
- Search API endpoints
- Filter parameters
- Sort options
- Query structure

### Discover Data Loading Patterns

```typescript
// Check for infinite scroll
await playwright_evaluate({
  script: `
    window.scrollTo(0, document.body.scrollHeight);
  `
});

await playwright_wait({ timeout: 2000 });

// Check if new content loaded
const newItemCount = await playwright_evaluate({
  script: "document.querySelectorAll('.product-item').length"
});

// Check console for API calls triggered by scroll
const scrollLogs = await playwright_console_logs({
  search: "api",
  limit: 10
});
```

---

## Step 1.4: Anti-Bot Assessment

### Detect Bot Protection

```typescript
// Check for common bot detection indicators
const protectionCheck = await playwright_evaluate({
  script: `
    const bodyText = document.body.textContent.toLowerCase();
    const html = document.documentElement.outerHTML;

    ({
      cloudflare: bodyText.includes('cloudflare') || html.includes('cf-ray'),
      captcha: bodyText.includes('captcha') || !!document.querySelector('.g-recaptcha, #px-captcha'),
      accessDenied: bodyText.includes('access denied') || bodyText.includes('403 forbidden'),
      rateLimited: bodyText.includes('too many requests') || bodyText.includes('429'),
      akamai: html.includes('akamai'),
      datadome: html.includes('datadome'),
      perimeter: html.includes('perimeterx')
    })
  `
});
```

### Check for Fingerprinting Scripts

```typescript
// Detect fingerprinting libraries
const fingerprintScripts = await playwright_evaluate({
  script: `
    Array.from(document.querySelectorAll('script[src]'))
      .map(s => s.src)
      .filter(src =>
        src.includes('fingerprint') ||
        src.includes('fp-') ||
        src.includes('akamai') ||
        src.includes('datadome') ||
        src.includes('perimeterx') ||
        src.includes('px-')
      );
  `
});
```

### Test Rate Limiting

```typescript
// Make multiple rapid requests to test limits
for (let i = 0; i < 10; i++) {
  await playwright_navigate({
    url: `https://target-site.com/products?page=${i}`
  });

  const blocked = await playwright_evaluate({
    script: `
      document.body.textContent.toLowerCase().includes('rate limit') ||
      document.body.textContent.toLowerCase().includes('too many requests')
    `
  });

  if (blocked) {
    console.log(`Rate limited after ${i} requests`);
    break;
  }
}
```

**Document Protection Mechanisms**:
```
Protection Assessment:
⚠️  Cloudflare: DETECTED (cf-ray header present)
✓   CAPTCHA: Not triggered during normal browsing
✓   Fingerprinting: Not detected
⚠️  Rate Limiting: ~60 requests/minute threshold
✓   Authentication: Not required for product pages

Countermeasures Needed:
- Use residential or datacenter proxies (Cloudflare)
- Respect rate limit: max 50 requests/minute
- Consider fingerprint-suite if blocks occur
- Rotate user agents
```

---

## Step 1.5: Generate Intelligence Report

### Compile Findings

Create structured report with all reconnaissance data:

```markdown
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🔍 INTELLIGENCE REPORT: example.com
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Generated: 2025-01-24 10:30 UTC

## 1. Site Architecture

**Framework**: Next.js 13 (detected __NEXT_DATA__)
**Rendering**: Hybrid SSR + CSR
**Primary Data Source**: Internal REST API

## 2. Discovered Endpoints

### Products API (Primary)
**Endpoint**: `GET /api/v2/products`
**Parameters**:
  - `page`: integer (1-250)
  - `limit`: integer (max 100, default 20)
  - `category`: string (optional filter)

**Response Structure**:
```json
{
  "products": [...],
  "total": 5000,
  "page": 1,
  "hasMore": true
}
```

**Authentication**: None required
**Rate Limit**: ~60 requests/minute

### Product Details API
**Endpoint**: `GET /api/v2/products/{id}`
**Response**: Full product object with variants
**Authentication**: None required

## 3. Sitemap Analysis

**Location**: `/sitemap_index.xml`
**Product URLs**: 5,000 URLs
**Update Frequency**: Daily
**Coverage**: 100% (matches API total)

## 4. Protection Mechanisms

| Mechanism | Status | Impact |
|-----------|--------|--------|
| Cloudflare | ✅ Active | Medium - use proxies |
| CAPTCHA | ⚪ Not triggered | None currently |
| Fingerprinting | ⚪ Not detected | None |
| Rate Limiting | ⚠️ 60/min | High - must respect |
| Auth Required | ❌ None | None |

## 5. Pagination Strategy

**Type**: API-based offset pagination
**Method**: Query parameters `?page=N&limit=M`
**Max per request**: 100 items
**Total pages**: 50 (at limit=100)
**Estimated total items**: 5,000

## 6. Optimal Scraping Strategy

### Recommended Approach: Hybrid (Sitemap + API)

**Phase A - URL Discovery** (~1 minute):
1. Parse sitemap.xml → Extract 5,000 product URLs
2. Extract product IDs from URLs: `/products/(\d+)`

**Phase B - Data Extraction** (~3-5 minutes):
1. Use discovered API: `/api/v2/products?page=1&limit=100`
2. Rate: 50 requests/minute (safe buffer)
3. Total requests needed: 50
4. Expected duration: ~1.5 minutes
5. Use datacenter proxies (Cloudflare present)

**Why This Works**:
- ✅ API is 10-100x faster than HTML scraping
- ✅ No HTML parsing needed (clean JSON)
- ✅ Sitemap validates completeness
- ✅ Under rate limit
- ✅ No authentication barriers

### Alternative: Direct Sitemap Scraping
If API becomes blocked:
- Scrape HTML from sitemap URLs
- Use project_playwright_crawler_ts template
- Enable proxies + fingerprint-suite
- Estimated time: 15-20 minutes (slower)

## 7. Implementation Checklist

- [ ] Use `gotScraping` or `fetch` for API calls
- [ ] Implement rate limiting: 50 requests/minute
- [ ] Use Apify proxy (datacenter tier minimum)
- [ ] Parse sitemap for product IDs
- [ ] Batch API requests (100 items per call)
- [ ] Add retry logic (3 attempts)
- [ ] Log failed requests
- [ ] Validate data completeness

## 8. Risk Assessment

**Low Risk** ✅:
- API is stable and publicly accessible
- No authentication required
- Rate limits are reasonable
- Cloudflare not aggressive (with proxies)

**Potential Issues** ⚠️:
- API structure may change (monitor for schema changes)
- Rate limit may tighten (respect current limits)
- Cloudflare may upgrade protection (have Playwright fallback ready)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
PROCEED TO PHASE 2: VALIDATION & PHASE 3: STRATEGY SELECTION
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
```

---

## Real-World Examples

### Example 1: E-Commerce Site with Hidden API

**User Request**: "Scrape products from shop.example.com"

**Reconnaissance Process**:

1. **Browser Session**:
```typescript
await playwright_navigate({ url: "https://shop.example.com" });
await playwright_screenshot({ name: "homepage" });
```

2. **Network Analysis**:
   - Clicked through categories
   - Observed XHR requests in DevTools
   - **Discovery**: `GET /api/products.json?collection_id=123`

3. **API Testing**:
```typescript
await playwright_evaluate({
  script: `
    fetch('/api/products.json?collection_id=123&limit=100')
      .then(r => r.json())
      .then(d => console.log('FOUND:', d.products.length, 'products'));
  `
});
```
   - Result: API returns 100 products per request, no auth needed

4. **Protection Check**:
   - No Cloudflare detected
   - No rate limiting after 20 test requests
   - Simple API, no fingerprinting

**Outcome**: Skip HTML scraping entirely, use direct API access (50x faster)

---

### Example 2: News Site with Infinite Scroll

**User Request**: "Scrape articles from news.example.com"

**Reconnaissance**:

1. **Initial Load**: Only 10 articles visible

2. **Scroll Test**:
```typescript
await playwright_evaluate({
  script: "window.scrollTo(0, document.body.scrollHeight)"
});
await playwright_wait({ timeout: 2000 });
```

3. **Network Observation**:
   - XHR triggered: `GET /api/articles?offset=10`
   - Pattern discovered: offset-based pagination

4. **Pagination Extraction**:
   - Offset increases by 10
   - Total articles: 500 (from API response header)
   - API accessible directly without browser

**Outcome**: Use API with offset pagination instead of Playwright scroll automation

---

### Example 3: Protected Site with Cloudflare

**User Request**: "Scrape data from protected.example.com"

**Reconnaissance**:

1. **Initial Load**: Cloudflare challenge page

2. **Protection Analysis**:
```typescript
const protection = await playwright_evaluate({
  script: `
    document.body.textContent.includes('Checking your browser')
  `
});
// Result: true - Cloudflare active
```

3. **Challenge Solving**: Wait for automatic challenge resolution
```typescript
await playwright_wait({ timeout: 5000 });
await playwright_screenshot({ name: "after-challenge" });
```

4. **Post-Challenge Analysis**:
   - Cookies set: `cf_clearance`, `__cfduid`
   - All subsequent requests require these cookies
   - Standard HTML scraping, no API found

**Outcome**:
- Must use Playwright (browser needed for challenge)
- Use fingerprint-suite for stealth
- Use residential proxies
- Implement cookie persistence

---

## Decision Tree

Based on reconnaissance findings, determine next steps:

```
Reconnaissance Complete
    ├─ API Discovered?
    │   ├─ YES → Prefer API route (Phase 3: API strategy)
    │   │         └─ Check: Auth required?
    │   │             ├─ NO → Direct API access ✅ (fastest)
    │   │             └─ YES → Browser auth + API extraction
    │   └─ NO → HTML scraping needed
    │             └─ Check: JavaScript-rendered?
    │                 ├─ YES → Use Playwright
    │                 └─ NO → Use Cheerio (10x faster)
    │
    ├─ Protection Detected?
    │   ├─ Cloudflare/bot detection
    │   │   └─ Add: Proxies + Fingerprint-suite
    │   ├─ Rate limiting
    │   │   └─ Add: Rate limiter (respect limits)
    │   └─ CAPTCHA
    │       └─ Consider: CAPTCHA solving service or manual intervention
    │
    └─ Sitemap Available?
        ├─ YES → Use for URL discovery
        └─ NO → Implement crawler for discovery
```

---

## Common Mistakes to Avoid

### ❌ Skipping Reconnaissance

**Bad**: Jump straight to coding based on assumptions
```javascript
// WRONG: Assuming structure
const crawler = new PlaywrightCrawler({
  async requestHandler({ page }) {
    // Blindly scraping HTML without knowing if API exists
  }
});
```

**Good**: Perform reconnaissance first
```javascript
// RIGHT: After discovering API in Phase 1
const response = await gotScraping({
  url: discoveredApiEndpoint,  // From reconnaissance
  responseType: 'json'
});
```

### ❌ Ignoring Network Tab

**Bad**: Only looking at visible HTML
**Good**: Monitor DevTools Network tab for hidden APIs

### ❌ Testing Without Protection Awareness

**Bad**: Write scraper, then discover Cloudflare blocks it
**Good**: Detect Cloudflare in Phase 1, plan proxies from start

---

## Tools Reference

### MCP Tool Usage

**Playwright MCP commands used in reconnaissance**:
- `playwright_navigate` - Open URLs
- `playwright_screenshot` - Capture visual state
- `playwright_click` - Interact with elements
- `playwright_fill` - Test form inputs
- `playwright_evaluate` - Execute JavaScript, test APIs
- `playwright_console_logs` - Retrieve network/error logs
- `playwright_wait` - Pause for async operations

**Not yet available but useful**:
- Chrome DevTools Protocol directly (use `playwright_evaluate` workaround)
- Network HAR export (use console logging workaround)

---

## Next Steps

After completing reconnaissance:

1. **Proceed to Phase 2**: Validate findings with automated checks (sitemaps, robots.txt)
2. **Proceed to Phase 3**: Present strategy recommendations based on intelligence
3. **Proceed to Phase 4**: Implement chosen strategy with confidence

**Key Advantage**: No more guesswork - every implementation decision is backed by reconnaissance data.

---

Back to main workflow: `../SKILL.md`

```

### workflows/implementation.md

```markdown
# Phase 3: Iterative Implementation

Patterns for implementing scrapers incrementally, starting simple and adding complexity only as needed.

## Step 1: Implement Recommended Approach

### Progressive Enhancement Pattern

1. Start with minimal working code
2. Test with small sample (5-10 items)
3. Validate data quality
4. Scale to full dataset

### Reference Implementation Patterns

- **Sitemap**: See `../strategies/sitemap-discovery.md`
- **API**: See `../strategies/api-discovery.md`
- **Playwright**: See `../strategies/playwright-scraping.md`
- **Examples**: See `../examples/` directory

## Step 2: Test Small Batch First

```javascript
// Example: Test with first 10 URLs
const urls = await robots.parseUrlsFromSitemaps();
const testUrls = urls.slice(0, 10);

console.log(`Testing with ${testUrls.length} URLs first...`);
// Implement scraping logic
// Validate output quality
```

### Validation Checklist

- ✓ Data structure correct?
- ✓ All fields populated?
- ✓ Any errors or null values?
- ✓ Performance acceptable?

## Step 3: Scale or Fallback

### If Test Succeeds

```javascript
console.log('✓ Test successful, scaling to full dataset...');
await crawler.addRequests(urls); // All URLs
await crawler.run();
```

### If Test Fails

```javascript
console.log('✗ Issues detected, falling back to alternative strategy...');
// Try next approach from recommendations
```

## Step 4: Handle Blocking (If Encountered)

### Identify Blocking Type

- **Rate limiting** → Slow down requests (`maxRequestsPerMinute`)
- **IP blocking** → Use proxies
- **Bot detection** → Use fingerprinting + proxies
- **Cloudflare/CAPTCHA** → Advanced techniques

### Apply Anti-Blocking

See `../strategies/anti-blocking.md` for complete guide.

```typescript
const crawler = new PlaywrightCrawler({
    // Enable fingerprinting
    useSessionPool: true,
    fingerprintOptions: {
        devices: ['desktop'],
        operatingSystems: ['windows', 'macos'],
        browsers: ['chrome'],
    },

    // Add proxies
    proxyConfiguration: await Actor.createProxyConfiguration({
        groups: ['RESIDENTIAL'],
    }),

    // Slow down
    maxConcurrency: 3,
    maxRequestsPerMinute: 30,
});
```

### Test Incrementally

1. Start with fingerprinting only
2. Add datacenter proxies if still blocked
3. Upgrade to residential proxies if needed
4. Add session rotation

## Step 5: Add Robustness

### Error Handling Pattern

```javascript
const crawler = new PlaywrightCrawler({
    maxRequestRetries: 3,
    requestHandlerTimeoutSecs: 60,

    async requestHandler({ page, request, log }) {
        try {
            // Scraping logic
        } catch (error) {
            log.error(`Failed to scrape ${request.url}: ${error.message}`);
            throw error; // Retry
        }
    },

    failedRequestHandler({ request, error }, { log }) {
        log.error(`Request failed after retries: ${request.url}`);
    },
});
```

### Enhancements to Add

- Error handling (try/catch)
- Retries with exponential backoff
- Progress logging
- Data validation
- Rate limiting respect

---

Back to main workflow: `../SKILL.md`

```

### workflows/productionization.md

```markdown
# Phase 4: Productionization (Apify Actor Creation)

Patterns for converting scrapers into production-ready Apify Actors.

## Activation Triggers

Load this workflow when user requests:
- "Make this an Apify Actor"
- "Productionize this scraper"
- "Deploy to Apify"
- "Create an actor from this"

## Step 1: Confirm TypeScript Preference

```
For production Actors, TypeScript is STRONGLY RECOMMENDED:

Benefits:
✓ Type safety (catch errors at compile time)
✓ IDE autocomplete for Apify/Crawlee APIs
✓ Better refactoring support
✓ Self-documenting code
✓ Industry standard for production code

Use TypeScript for this Actor? [Y/n]
```

## Step 2: Select Appropriate Template

Based on Phase 1 site analysis, choose the optimal template:

### Decision Tree

**1. Analyze site characteristics** (from Phase 1 reconnaissance):
   - Static HTML / Server-Side Rendering → Use Cheerio
   - JavaScript-rendered content → Use Playwright
   - Anti-bot challenges → Consider Camoufox variant

**2. Template recommendations**:

#### Option A: `project_cheerio_crawler_ts` (Recommended for most cases)
**Use when:**
- Site serves static HTML or SSR content
- No JavaScript execution needed
- Speed and efficiency are priorities (~10x faster than Playwright)
- Simple scraping without complex interactions

**Benefits:**
- Fastest option (raw HTTP requests)
- Lower resource usage
- Perfect for: blogs, news sites, e-commerce product pages (non-SPA)

#### Option B: `project_playwright_crawler_ts`
**Use when:**
- JavaScript frameworks (React, Vue, Angular, Next.js)
- Dynamic content loading (infinite scroll, lazy loading)
- Need browser interactions (clicking, scrolling, forms)
- Anti-scraping measures present

**Benefits:**
- Full browser automation
- Handles complex JavaScript
- Better for modern SPAs

#### Option C: `project_playwright_camoufox_crawler_ts` (Advanced)
**Use when:**
- Facing serious anti-bot challenges
- Standard Playwright is being blocked
- Need stealth browser fingerprinting

**Note**: Mentioned in `../strategies/anti-blocking.md`

### Easy Migration
Switching from CheerioCrawler to PlaywrightCrawler requires minimal changes:
- Change import: `CheerioCrawler` → `PlaywrightCrawler`
- Adjust selectors if needed (both use similar syntax)
- Core logic remains identical

### Hybrid Approach (Advanced)
```typescript
// Try Cheerio first for speed
const cheerioCrawler = new CheerioCrawler({ /* ... */ });

// Fallback to Playwright for failed requests
const playwrightCrawler = new PlaywrightCrawler({
    requestHandler: async ({ page, request }) => {
        // Handle failed Cheerio requests
    }
});
```

## Step 3: Initialize with Apify CLI

**CRITICAL**: Always use `apify create` command

```bash
# Install CLI if needed
npm install -g apify-cli

# ALWAYS use apify create (not manual setup)
apify create my-scraper

# When prompted, select the appropriate template:
# → project_cheerio_crawler_ts (static HTML, fastest)
# → project_playwright_crawler_ts (JavaScript-heavy sites)
# → project_playwright_camoufox_crawler_ts (anti-bot challenges)
```

### Why This Is Critical

- Auto-generates proper structure
- Includes ESLint, TypeScript config
- Creates .actor/ directory correctly
- Sets up npm scripts
- Adds proper Dockerfile

See `../apify/cli-workflow.md` for complete guide.

## Step 4: Port Scraping Logic

### Conversion Checklist

1. Wrap in `Actor.main()`
2. Add type definitions (if TypeScript)
3. Configure input schema
4. Set up dataset output
5. Add error handling

**Reference AGENTS.md**: After running `apify create`, the template includes `AGENTS.md` with detailed guidance on:
- Input/output schema specifications
- Dataset and key-value store patterns
- Do/Don't best practices for Actor development
- SDK usage patterns

See `../apify/templates/` for complete templates and `../apify/agents-md-guide.md` for how AGENTS.md complements this skill.

## Step 5: Test & Deploy

```bash
# Test locally
apify run

# Build (for TypeScript)
npm run build

# Deploy to platform
apify push
```

See `../apify/deployment.md` for full deployment guide.

## Quick Apify Reference

| Task | Command/Pattern | Documentation |
|------|----------------|---------------|
| Create Actor | `apify create` | `../apify/cli-workflow.md` |
| Template selection | Decision tree above | This guide |
| Input schema | `.actor/input_schema.json` | `../apify/input-schemas.md` |
| Configuration | `.actor/actor.json` | `../apify/configuration.md` |
| Deploy actor | `apify push` | `../apify/deployment.md` |

## Complete Apify Module

See `../apify/` directory for:
- **Core Guides**: TypeScript-first, CLI workflow, initialization, input schemas, configuration, deployment
- **Templates**: Complete TypeScript actor template
- **Examples**: 3 production-ready actor examples

---

Back to main workflow: `../SKILL.md`

```

### strategies/sitemap-discovery.md

```markdown
# Sitemap-Based URL Discovery

## Overview

Sitemaps are XML files that list all URLs on a website, providing the fastest and most efficient way to discover pages to scrape. Instead of crawling page-by-page, you can get all URLs instantly.

## When to Use Sitemaps

### ✅ USE sitemaps when:
- Website has a sitemap (check `/sitemap.xml` or `robots.txt`)
- Large websites with 100+ pages
- Product catalogs, blogs, news sites
- Need complete site coverage
- URLs follow predictable patterns
- E-commerce sites (products, categories)
- Time-sensitive scraping (need fast results)

### ❌ DON'T use sitemaps when:
- Site doesn't have a sitemap
- Single-page applications with dynamic content
- Need to follow user flows (login, navigation, shopping cart)
- Sitemap is outdated or incomplete
- Crawling logic depends on page content
- Site uses heavy JavaScript for navigation

## Finding Sitemaps

Sitemaps are typically found at these locations:

```
https://example.com/sitemap.xml              ← Most common
https://example.com/robots.txt               ← Lists sitemap URLs
https://example.com/sitemap_index.xml        ← Sitemap of sitemaps
https://example.com/product-sitemap.xml      ← Product-specific
https://example.com/sitemap.xml.gz           ← Compressed
https://example.com/sitemaps/sitemap.xml     ← In subdirectory
```

### Always Check robots.txt First

```bash
curl https://example.com/robots.txt
```

Example robots.txt:
```
User-agent: *
Sitemap: https://example.com/sitemap_index.xml
Sitemap: https://example.com/products-sitemap.xml
Sitemap: https://example.com/blog-sitemap.xml
```

## Implementation Patterns

### Pattern 1: Automatic Discovery (Recommended)

Use `RobotsFile` to automatically find and parse all sitemaps:

```javascript
import { PlaywrightCrawler, RobotsFile, Dataset } from 'crawlee';

// Automatically finds robots.txt and parses ALL sitemaps
const robots = await RobotsFile.find('https://example.com');
const allUrls = await robots.parseUrlsFromSitemaps();

console.log(`Found ${allUrls.length} URLs from sitemaps`);

// Create crawler
const crawler = new PlaywrightCrawler({
    async requestHandler({ page, request, log }) {
        log.info(`Scraping: ${request.url}`);

        const data = await page.evaluate(() => ({
            title: document.title,
            price: document.querySelector('.price')?.textContent,
            description: document.querySelector('.description')?.textContent,
        }));

        await Dataset.pushData(data);
    },
});

// Add all sitemap URLs
await crawler.addRequests(allUrls);
await crawler.run();
```

**Benefits**:
- Handles sitemap indexes (nested sitemaps)
- Handles compressed sitemaps (.gz)
- Respects robots.txt rules
- No need to know sitemap URLs upfront

### Pattern 2: Filtered URLs with Regex

Use `RequestList` to filter only specific URL patterns:

```javascript
import { PlaywrightCrawler, RequestList, Dataset } from 'crawlee';

// Load sitemap and filter URLs with regex
const requestList = await RequestList.open(null, [{
    requestsFromUrl: 'https://shop.com/sitemap.xml',
    // Only product pages (not categories, help pages, etc.)
    regex: /\/products\/[a-z0-9-]+$/i,
}]);

const crawler = new PlaywrightCrawler({
    requestList,
    async requestHandler({ page, request, log }) {
        log.info(`Scraping product: ${request.url}`);

        const product = await page.evaluate(() => ({
            name: document.querySelector('h1')?.textContent,
            price: document.querySelector('[data-testid="price"]')?.textContent,
            sku: document.querySelector('[data-sku]')?.dataset.sku,
        }));

        await Dataset.pushData(product);
    },
});

await crawler.run();
```

**Common Regex Patterns**:

```javascript
// Products only
regex: /\/products\/[a-z0-9-]+$/i

// Blog posts (with date pattern)
regex: /\/blog\/\d{4}\/\d{2}\/[a-z0-9-]+/i

// Exclude categories, only products
regex: /\/products\/[^/<]+$/

// Multiple patterns (products OR deals)
regex: /(\/products\/[^/<]+|\/deals\/[^/<]+)/

// Specific category
regex: /\/products\/electronics\/[^/<]+$/
```

See `../reference/regex-patterns.md` for more patterns.

### Pattern 3: Multiple Specific Sitemaps

Load specific sitemap URLs directly:

```javascript
import { PlaywrightCrawler, Sitemap, Dataset } from 'crawlee';

// Load multiple sitemaps
const sitemap = await Sitemap.load([
    'https://example.com/product-sitemap.xml',
    'https://example.com/blog-sitemap.xml.gz', // Handles .gz automatically
]);

console.log(`Found ${sitemap.urls.length} URLs`);

const crawler = new PlaywrightCrawler({
    async requestHandler({ page, request, log }) {
        // Handle both products and blog posts
        if (request.url.includes('/products/')) {
            // Scrape product
            const product = await page.evaluate(() => ({
                name: document.querySelector('h1')?.textContent,
                price: document.querySelector('.price')?.textContent,
            }));
            await Dataset.pushData({ type: 'product', ...product });
        } else if (request.url.includes('/blog/')) {
            // Scrape blog post
            const post = await page.evaluate(() => ({
                title: document.querySelector('h1')?.textContent,
                content: document.querySelector('.content')?.textContent,
            }));
            await Dataset.pushData({ type: 'post', ...post });
        }
    },
});

await crawler.addRequests(sitemap.urls);
await crawler.run();
```

### Pattern 4: Hybrid (Sitemap + Crawling)

Start with sitemap, then also crawl discovered links:

```javascript
import { PlaywrightCrawler, RobotsFile, Dataset } from 'crawlee';

// Start with sitemap URLs
const robots = await RobotsFile.find('https://example.com');
const sitemapUrls = await robots.parseUrlsFromSitemaps();

const crawler = new PlaywrightCrawler({
    maxRequestsPerCrawl: 5000,
    async requestHandler({ page, enqueueLinks, request, log }) {
        log.info(`Processing: ${request.url}`);

        // Scrape data
        const data = await page.evaluate(() => ({
            title: document.title,
            links: Array.from(document.querySelectorAll('a')).map(a => a.href),
        }));

        await Dataset.pushData(data);

        // ALSO crawl discovered links (optional)
        await enqueueLinks({
            selector: 'a[href*="/products/"]',
            strategy: 'same-domain',
        });
    },
});

// Start with all sitemap URLs
await crawler.addRequests(sitemapUrls);
await crawler.run();
```

## URL Filtering Techniques

### Using lastmod Dates

Filter URLs by last modification date:

```javascript
import { Sitemap } from 'crawlee';

const sitemap = await Sitemap.load(['https://site.com/sitemap.xml']);

// Filter to recently updated URLs (last 30 days)
const recentUrls = sitemap.urls.filter(urlObj => {
    const lastMod = new Date(urlObj.lastmod);
    const monthAgo = new Date(Date.now() - 30 * 24 * 60 * 60 * 1000);
    return lastMod > monthAgo;
}).map(urlObj => urlObj.loc);

console.log(`Found ${recentUrls.length} recently updated URLs`);
```

### Using Priority

Filter by sitemap priority (0.0 to 1.0):

```javascript
// Get only high-priority pages
const highPriorityUrls = sitemap.urls.filter(urlObj => {
    return parseFloat(urlObj.priority) >= 0.8;
}).map(urlObj => urlObj.loc);
```

## Error Handling

Always handle cases where sitemaps might not exist or be malformed:

```javascript
import { PlaywrightCrawler, RobotsFile, Dataset } from 'crawlee';

try {
    // Try to find and parse sitemaps
    const robots = await RobotsFile.find('https://example.com');
    const urls = await robots.parseUrlsFromSitemaps();

    if (urls.length === 0) {
        console.log('⚠ No URLs found in sitemaps, falling back to crawling');
        // Fall back to traditional crawling
        await crawler.run(['https://example.com']);
    } else {
        console.log(`✓ Found ${urls.length} URLs, starting scrape`);

        const crawler = new PlaywrightCrawler({
            async requestHandler({ page, request, log }) {
                // Scrape logic
            },
            failedRequestHandler({ request, error }, { log }) {
                log.error(`Request ${request.url} failed: ${error.message}`);
            },
        });

        await crawler.addRequests(urls);
        await crawler.run();
    }
} catch (error) {
    console.error(`✗ Sitemap discovery failed: ${error.message}`);
    console.log('Falling back to traditional crawling');
    // Implement fallback strategy
}
```

## Best Practices

### ✅ DO:

- **Check robots.txt first** for sitemap locations
- **Use `RobotsFile.find()`** for automatic discovery
- **Filter URLs with regex** when you only need specific page types
- **Verify sitemap is current** before relying on it (check lastmod dates)
- **Use `lastmod` dates** to avoid re-scraping unchanged content
- **Handle compressed sitemaps** (.gz files) - Crawlee does this automatically
- **Combine with crawling** for completeness if needed
- **Test sitemap URLs** before running full scrape (sample 5-10 first)
- **Log progress** clearly (URLs found, filtered, scraped)

### ❌ DON'T:

- **Assume all sites have sitemaps** - always have a fallback
- **Trust sitemaps to be complete** - some pages may be missing
- **Use sitemaps for dynamic/SPA content** - crawling is better
- **Forget to filter URLs** - sitemaps often include pages you don't need
- **Ignore robots.txt rules** - respect crawl directives
- **Scrape login-protected pages** from sitemaps - won't work
- **Skip error handling** - some sitemap URLs may be broken
- **Ignore rate limits** - even with sitemaps, respect robots.txt crawl-delay

## Performance Comparison

| Metric | Sitemap | Traditional Crawling | Improvement |
|--------|---------|----------------------|-------------|
| **URL Discovery** | 5-10 seconds | 5-10 minutes | ⚡ 60x faster |
| **Bandwidth** | ~2 MB | ~200 MB | 💾 100x less |
| **Coverage** | 100% (if current) | 80-90% | ✅ Better |
| **Time to First Data** | 10-20 seconds | 5-10 minutes | ⏱️ 30x faster |

## Troubleshooting

### Problem: No URLs found in sitemap

**Solutions**:
```javascript
// 1. Check if sitemap exists manually
const response = await fetch('https://example.com/sitemap.xml');
if (!response.ok) {
    console.log('No sitemap found at /sitemap.xml');
}

// 2. Check robots.txt
const robotsResponse = await fetch('https://example.com/robots.txt');
const robotsText = await robotsResponse.text();
console.log('Sitemap directives:', robotsText.match(/Sitemap:.+/gi));

// 3. Fall back to crawling
console.log('Falling back to traditional crawling');
```

### Problem: Sitemap has too many irrelevant URLs

**Solution**: Use regex filtering
```javascript
const requestList = await RequestList.open(null, [{
    requestsFromUrl: 'https://site.com/sitemap.xml',
    regex: /\/products\/[^/<]+$/, // Only product pages
}]);
```

### Problem: Sitemap URLs return 404

**Solution**: Add error handling
```javascript
const crawler = new PlaywrightCrawler({
    failedRequestHandler({ request, error }, { log }) {
        log.warning(`URL from sitemap returned error: ${request.url}`);
        // Don't crash, just log and continue
    },
});
```

## Related Resources

- **Regex patterns**: See `../reference/regex-patterns.md`
- **Hybrid approaches**: See `hybrid-approaches.md`
- **API discovery**: See `api-discovery.md` (often better than scraping)
- **Examples**: See `../examples/sitemap-basic.js`

## Summary

**Sitemaps are the FASTEST way to discover URLs** - use them whenever possible!

**Key takeaways**:
1. Always check for sitemaps first (60x faster than crawling)
2. Use `RobotsFile.find()` for automatic discovery
3. Filter with regex to get only relevant URLs
4. Always have a fallback to crawling
5. Combine with API discovery for best results

```

### reference/regex-patterns.md

```markdown
# Common Regex Patterns for URL Filtering

Quick reference for filtering sitemap URLs with regex.

## Product Pages

```javascript
// Basic product pattern
/\/products\/[a-z0-9-]+$/i

// Product with numeric ID
/\/products\/(\d+)/

// Product with slug
/\/products\/([a-z0-9-]+)$/i

// Exclude category pages
/\/products\/[^\/]+$/
// Matches: /products/shoe-123
// Skips: /products/shoes/running

// Specific category
/\/products\/electronics\/[^\/]+$/
```

## Blog Posts

```javascript
// Blog with date
/\/blog\/\d{4}\/\d{2}\/[a-z0-9-]+/i
// Matches: /blog/2025/10/my-post

// Blog without date
/\/blog\/[a-z0-9-]+$/i

// WordPress pattern
/\/\d{4}\/\d{2}\/[a-z0-9-]+/
```

## Multiple Patterns

```javascript
// Products OR deals
/(\/products\/[^\/]+|\/deals\/[^\/]+)/

// Multiple categories
/\/(electronics|clothing|books)\/[^\/]+$/
```

## Exclude Patterns

```javascript
// Exclude pages
/^(?!.*(about|contact|help)).*$/

// Exclude file extensions
/^(?!.*\.(pdf|jpg|png)).*$/
```

## Usage with RequestList

```javascript
import { RequestList } from 'crawlee';

const requestList = await RequestList.open(null, [{
    requestsFromUrl: 'https://site.com/sitemap.xml',
    regex: /\/products\/[^\/]+$/,
}]);
```

## Testing Patterns

Test your regex before running:

```javascript
const pattern = /\/products\/[^\/]+$/;
const urls = [
    'https://shop.com/products/shoe-123',      // ✓ Match
    'https://shop.com/products/shoes/running', // ✗ No match (has /)
    'https://shop.com/products',               // ✗ No match (no product)
];

urls.forEach(url => {
    console.log(`${url}: ${pattern.test(url) ? '✓' : '✗'}`);
});
```

```

### strategies/api-discovery.md

```markdown
# API Discovery and Usage

## Overview

Many websites expose hidden APIs that are **10-100x faster and more reliable** than scraping HTML. Always look for APIs before writing scraping code!

## Why APIs are Better Than Scraping

| Aspect | API | HTML Scraping |
|--------|-----|---------------|
| **Speed** | Very fast (JSON responses) | Slow (render full pages) |
| **Reliability** | Stable structure | Breaks when HTML changes |
| **Data Quality** | Clean, structured JSON | Messy, requires parsing |
| **Bandwidth** | Low (only data) | High (images, CSS, JS) |
| **Maintenance** | Low (stable contracts) | High (fragile selectors) |
| **Rate Limiting** | Clear limits | Ambiguous |

**Example**:
- Scraping HTML: Load entire page (~500 KB), parse HTML, extract data
- Using API: GET `/api/product/123` returns clean JSON (~5 KB)

**Result**: 100x less bandwidth, 10x faster, 0 HTML parsing

## How to Find APIs

### Step 1: Open Browser DevTools

1. Open target website in browser
2. Press `F12` or `Ctrl+Shift+I` (Windows/Linux) or `Cmd+Option+I` (Mac)
3. Go to **Network** tab
4. Filter by **XHR** or **Fetch**

### Step 2: Navigate the Website

Interact with the website normally:
- Browse products
- Search for items
- Click pagination
- Load more content
- Submit forms

Watch the Network tab for API requests!

### Step 3: Identify API Patterns

Look for requests to:
```
/api/...
/v1/...
/v2/...
/graphql
/_next/data/...
/wp-json/...
/rest/...
```

### Step 4: Analyze Requests

For each promising request, check:
- **URL pattern**: Can you construct similar URLs?
- **Method**: GET, POST, etc.
- **Headers**: Authentication? Content-Type?
- **Query parameters**: Pagination? Filters?
- **Response format**: JSON? GraphQL? XML?

## Common API Patterns

### REST APIs

**Pattern**: `https://api.example.com/v1/resources/{id}`

**Example**:
```
GET https://shop.com/api/products/12345
GET https://shop.com/api/products?category=electronics&limit=50
```

**How to use**:
```javascript
import { gotScraping } from 'got-scraping';

const response = await gotScraping({
    url: 'https://shop.com/api/products/12345',
    responseType: 'json',
});

console.log(response.body); // Clean JSON object
```

### GraphQL APIs

**Pattern**: POST to `/graphql` with query in body

**Example**:
```graphql
POST https://example.com/graphql

{
  "query": "{ products(limit: 10) { id name price } }"
}
```

**How to use**:
```javascript
const response = await gotScraping({
    url: 'https://example.com/graphql',
    method: 'POST',
    json: {
        query: `{
            products(limit: 10) {
                id
                name
                price
                inStock
            }
        }`
    },
    responseType: 'json',
});

console.log(response.body.data.products);
```

### Paginated APIs

**Pattern**: `?page=1&limit=50` or cursor-based

**Example**:
```javascript
async function fetchAllProducts() {
    const allProducts = [];
    let page = 1;
    let hasMore = true;

    while (hasMore) {
        const response = await gotScraping({
            url: `https://api.shop.com/products?page=${page}&limit=50`,
            responseType: 'json',
        });

        allProducts.push(...response.body.products);

        hasMore = response.body.hasNextPage;
        page++;
    }

    return allProducts;
}
```

## Authentication Handling

### Cookies

Extract from browser session:

```javascript
import { PlaywrightCrawler } from 'crawlee';

const browser = await chromium.launch();
const page = await browser.newPage();

// Navigate to site and let user login (or automate it)
await page.goto('https://example.com');

// Get cookies
const cookies = await page.context().cookies();

// Use in API requests
await gotScraping({
    url: 'https://api.example.com/data',
    headers: {
        'Cookie': cookies.map(c => `${c.name}=${c.value}`).join('; '),
    },
});
```

### Bearer Tokens

Extract from localStorage or API responses:

```javascript
// Get token from browser
const token = await page.evaluate(() => {
    return localStorage.getItem('auth_token');
});

// Use in API requests
await gotScraping({
    url: 'https://api.example.com/data',
    headers: {
        'Authorization': `Bearer ${token}`,
    },
});
```

### API Keys

Sometimes visible in Network tab headers:

```javascript
await gotScraping({
    url: 'https://api.example.com/data',
    headers: {
        'X-API-Key': 'abc123...',
        'X-Client-ID': 'web-app',
    },
});
```

## Hybrid Approach: Sitemap URLs + API Data

**Best of both worlds**: Use sitemap for URLs, API for data

```javascript
import { RobotsFile } from 'crawlee';
import { gotScraping } from 'got-scraping';

// Get all product URLs from sitemap
const robots = await RobotsFile.find('https://shop.com');
const urls = await robots.parseUrlsFromSitemaps();

// Extract product IDs from URLs
const productIds = urls
    .map(url => url.match(/\/products\/(\d+)/)?.[1])
    .filter(Boolean);

console.log(`Found ${productIds.length} products`);

// Fetch data from API (much faster than scraping pages)
for (const id of productIds) {
    const response = await gotScraping({
        url: `https://api.shop.com/v1/products/${id}`,
        responseType: 'json',
    });

    console.log(response.body);
    // Clean, structured data!
}
```

See `../examples/hybrid-sitemap-api.js` for complete example.

## got-scraping vs fetch

### Use `got-scraping` (Recommended)

**Benefits**:
- Automatic retries
- Browser-like headers
- Proxy support
- Cookie handling
- Response type conversion

```javascript
import { gotScraping } from 'got-scraping';

const response = await gotScraping({
    url: 'https://api.example.com/data',
    responseType: 'json', // Auto-parses JSON
    retry: {
        limit: 3,
    },
});

console.log(response.body); // Already parsed JSON
```

### Use `fetch` (Simple cases)

For simple requests:

```javascript
const response = await fetch('https://api.example.com/data');
const data = await response.json();
```

## Common API Patterns to Look For

### 1. Next.js Data

```
/_next/data/BUILD_ID/products/123.json
```

### 2. WordPress REST API

```
/wp-json/wp/v2/posts
/wp-json/wp/v2/pages
```

### 3. Shopify API

```
/products.json
/collections.json
/products/HANDLE.json
```

### 4. Internal APIs

```
/api/v1/...
/internal/api/...
/_api/...
```

## Rate Limiting

Respect API rate limits:

```javascript
import { setTimeout } from 'timers/promises';

for (const id of productIds) {
    const response = await gotScraping({
        url: `https://api.example.com/products/${id}`,
        responseType: 'json',
    });

    console.log(response.body);

    // Respect rate limits (e.g., 10 requests/second)
    await setTimeout(100); // 100ms delay
}
```

Better: Use Crawlee's built-in rate limiting:

```javascript
import { HttpCrawler } from 'crawlee';

const crawler = new HttpCrawler({
    maxRequestsPerMinute: 60,
    async requestHandler({ json }) {
        console.log(json);
    },
});
```

## Error Handling

Always handle API errors:

```javascript
try {
    const response = await gotScraping({
        url: `https://api.example.com/products/${id}`,
        responseType: 'json',
        timeout: {
            request: 10000, // 10 second timeout
        },
    });

    // Check for API-level errors
    if (response.body.error) {
        throw new Error(`API error: ${response.body.error}`);
    }

    return response.body;

} catch (error) {
    if (error.response?.statusCode === 404) {
        console.log(`Product ${id} not found`);
        return null;
    } else if (error.response?.statusCode === 429) {
        console.log('Rate limited, waiting...');
        await setTimeout(5000);
        // Retry
    } else {
        throw error;
    }
}
```

## Best Practices

### ✅ DO:

- **Always check for APIs first** before scraping HTML
- **Analyze Network tab** on every scraping project
- **Use got-scraping** for better reliability
- **Respect rate limits** (add delays or use Crawlee)
- **Handle authentication** properly (cookies, tokens)
- **Cache API responses** to avoid redundant requests
- **Log API calls** for debugging
- **Use TypeScript** for type-safe API responses

### ❌ DON'T:

- **Skip API discovery** - always check first!
- **Ignore rate limits** - you'll get blocked
- **Hardcode credentials** - use environment variables
- **Trust API responses** - validate data
- **Forget error handling** - APIs fail too
- **Make redundant requests** - cache when possible

## Complete Example: API-First Scraper

```javascript
import { gotScraping } from 'got-scraping';
import { setTimeout } from 'timers/promises';

async function scrapeProducts(productIds) {
    const results = [];

    for (const id of productIds) {
        try {
            console.log(`Fetching product ${id}...`);

            const response = await gotScraping({
                url: `https://api.shop.com/v1/products/${id}`,
                responseType: 'json',
                headers: {
                    'User-Agent': 'Mozilla/5.0...',
                },
                timeout: {
                    request: 10000,
                },
                retry: {
                    limit: 3,
                    methods: ['GET'],
                },
            });

            results.push(response.body);

            // Rate limiting (10 req/sec max)
            await setTimeout(100);

        } catch (error) {
            console.error(`Failed to fetch product ${id}:`, error.message);
        }
    }

    return results;
}

// Usage
const productIds = [123, 456, 789];
const products = await scrapeProducts(productIds);
console.log(`Scraped ${products.length} products`);
```

## Related Resources

- **Sitemap discovery**: See `sitemap-discovery.md` (get IDs from URLs)
- **Hybrid approach**: See `hybrid-approaches.md`
- **Examples**: See `../examples/api-scraper.js`
- **Examples**: See `../examples/hybrid-sitemap-api.js`

## Summary

**APIs are the BEST way to get data** - always look for them first!

**Key takeaways**:
1. Open DevTools Network tab before scraping
2. Look for `/api/`, `/v1/`, `/graphql` endpoints
3. Use got-scraping for reliability
4. Combine with sitemaps for complete coverage
5. Respect rate limits and authentication
6. 10-100x faster than HTML scraping!

```

### strategies/playwright-scraping.md

```markdown
# Playwright-Based Scraping

## Overview

Use Playwright when websites require JavaScript rendering, user interactions, or when APIs and sitemaps aren't available. Playwright provides a real browser environment for complex scraping scenarios.

## When to Use Playwright

### ✅ USE Playwright when:
- Site requires JavaScript rendering (React, Vue, Angular, etc.)
- Need to interact with page elements (clicks, forms, scrolling)
- Content loads dynamically (AJAX, infinite scroll)
- No sitemap or API available
- Authentication flows required (login, cookies)
- Need to capture screenshots or PDFs
- Single-page applications (SPAs)

### ❌ DON'T use Playwright when:
- Site has an API (use API instead - 10x faster)
- Static HTML works fine (use Cheerio - 5x faster)
- Simple GET requests sufficient
- Site has sitemaps (combine with sitemap for URLs)
- High-volume scraping (resource intensive)

## Selector Strategies (Priority Order)

Always use selectors in this priority order (most stable → least stable):

### 1. Role-Based Selectors (Most Stable)

```javascript
// Get by role and name
await page.getByRole('button', { name: 'Add to cart' }).click();
await page.getByRole('heading', { level: 1 }).textContent();
await page.getByRole('link', { name: 'Next page' }).click();

// Common roles
page.getByRole('button')
page.getByRole('link')
page.getByRole('textbox')
page.getByRole('checkbox')
page.getByRole('heading')
page.getByRole('list')
page.getByRole('listitem')
```

**Why**: Based on semantic HTML, survives CSS/class name changes.

### 2. Test IDs (Developer-Friendly)

```javascript
await page.getByTestId('product-price').textContent();
await page.getByTestId('add-to-cart-button').click();
```

**Why**: Designed for testing, stable across refactors.

### 3. Labels (Form Elements)

```javascript
await page.getByLabel('Email').fill('[email protected]');
await page.getByLabel('Password').fill('password123');
await page.getByLabel('Remember me').check();
```

**Why**: Accessible, user-centric, stable.

### 4. Text Content

```javascript
await page.getByText('Sign in').click();
await page.getByText('Add to cart').click();
```

**Why**: Works when text is stable, intuitive.

### 5. CSS/XPath (Last Resort)

```javascript
// CSS selectors
const price = await page.locator('.product-price').textContent();
const title = await page.locator('h1.title').textContent();

// XPath
const element = await page.locator('xpath=//div[@class="content"]');
```

**Why**: Fragile, breaks when HTML structure changes. Use only when nothing else works.

## Auto-Waiting (Never Use setTimeout!)

Playwright automatically waits for elements. **Never use arbitrary timeouts**.

### ❌ BAD - Arbitrary Waits

```javascript
await page.waitForTimeout(5000); // DON'T DO THIS!
await page.waitForTimeout(3000); // NEVER!
```

### ✅ GOOD - Wait for Specific Conditions

```javascript
// Wait for element to be visible
await page.waitForSelector('.product-loaded');

// Wait for network to be idle
await page.waitForLoadState('networkidle');

// Wait for specific URL
await page.waitForURL('**/products/**');

// Wait with assertions (best)
await expect(page.getByRole('heading')).toBeVisible();
```

### ✅ BETTER - Implicit Waiting

Playwright actions automatically wait:

```javascript
// Automatically waits for element to be:
// - Attached to DOM
// - Visible
// - Stable (not animating)
// - Enabled
await page.getByRole('button').click();

// Automatically waits for element
const title = await page.getByRole('heading').textContent();
```

## Basic Scraping Pattern

```javascript
import { PlaywrightCrawler, Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ page, request, log }) {
        log.info(`Scraping: ${request.url}`);

        // Navigate if needed
        await page.goto(request.url, {
            waitUntil: 'domcontentloaded',
        });

        // Wait for content
        await page.waitForSelector('.product-info');

        // Extract data
        const data = await page.evaluate(() => ({
            title: document.querySelector('h1')?.textContent?.trim(),
            price: document.querySelector('.price')?.textContent?.trim(),
            description: document.querySelector('.description')?.textContent?.trim(),
            images: Array.from(document.querySelectorAll('.product-image'))
                .map(img => img.src),
            inStock: document.querySelector('.in-stock') !== null,
        }));

        // Save data
        await Dataset.pushData({
            url: request.url,
            ...data,
            scrapedAt: new Date().toISOString(),
        });
    },
});

await crawler.run(['https://example.com/product/123']);
```

## Common Patterns

### Pattern 1: Extract with page.evaluate()

```javascript
const data = await page.evaluate(() => {
    return {
        title: document.title,
        price: document.querySelector('.price')?.textContent,
        // Extract multiple items
        products: Array.from(document.querySelectorAll('.product')).map(el => ({
            name: el.querySelector('.name')?.textContent,
            price: el.querySelector('.price')?.textContent,
        })),
    };
});
```

### Pattern 2: Extract with Playwright Locators

```javascript
const title = await page.locator('h1').textContent();
const price = await page.locator('.price').textContent();

// Extract multiple elements
const productNames = await page.locator('.product-name').allTextContents();
```

### Pattern 3: Handle Dynamic Content

```javascript
// Infinite scroll
async function scrollToBottom(page) {
    let previousHeight = 0;
    let currentHeight = await page.evaluate(() => document.body.scrollHeight);

    while (previousHeight !== currentHeight) {
        previousHeight = currentHeight;
        await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
        await page.waitForTimeout(1000); // Small delay for content to load
        currentHeight = await page.evaluate(() => document.body.scrollHeight);
    }
}

await scrollToBottom(page);
// Now extract all loaded content
```

### Pattern 4: Click "Load More" Buttons

```javascript
while (true) {
    const loadMore = await page.locator('button:has-text("Load More")').count();

    if (loadMore === 0) {
        break; // No more button
    }

    await page.getByRole('button', { name: 'Load More' }).click();
    await page.waitForLoadState('networkidle');
}

// Now extract all loaded products
```

### Pattern 5: Handle Pagination

```javascript
let currentPage = 1;
const maxPages = 10;

while (currentPage <= maxPages) {
    console.log(`Scraping page ${currentPage}...`);

    // Extract data from current page
    const products = await page.evaluate(() => {
        return Array.from(document.querySelectorAll('.product')).map(el => ({
            name: el.querySelector('.name')?.textContent,
            price: el.querySelector('.price')?.textContent,
        }));
    });

    await Dataset.pushData(products);

    // Check if next page exists
    const nextButton = await page.locator('a.next-page').count();
    if (nextButton === 0) {
        break;
    }

    // Go to next page
    await page.getByRole('link', { name: 'Next' }).click();
    await page.waitForLoadState('networkidle');

    currentPage++;
}
```

## Authentication

### Pattern 1: Login Flow

```javascript
import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    async requestHandler({ page, request, log }) {
        // Navigate to login page
        await page.goto('https://example.com/login');

        // Fill login form
        await page.getByLabel('Email').fill('[email protected]');
        await page.getByLabel('Password').fill('password123');
        await page.getByRole('button', { name: 'Sign in' }).click();

        // Wait for redirect
        await page.waitForURL('**/dashboard');

        // Now navigate to target page
        await page.goto(request.url);

        // Extract data
        // ...
    },
});
```

### Pattern 2: Reuse Authenticated Session

```javascript
// Save session after first login
const context = await browser.newContext();
const page = await context.newPage();

// Login once
await page.goto('https://example.com/login');
// ... login flow ...

// Save cookies/localStorage
const cookies = await context.cookies();
const localStorage = await page.evaluate(() => JSON.stringify(localStorage));

// Reuse in new sessions
const newContext = await browser.newContext({
    storageState: {
        cookies: cookies,
        origins: [{
            origin: 'https://example.com',
            localStorage: JSON.parse(localStorage),
        }],
    },
});
```

## Error Handling

```javascript
const crawler = new PlaywrightCrawler({
    async requestHandler({ page, request, log }) {
        try {
            await page.goto(request.url, {
                waitUntil: 'domcontentloaded',
                timeout: 30000,
            });

            // Verify page loaded
            const isLoaded = await page.locator('body').isVisible();
            if (!isLoaded) {
                throw new Error('Page did not load properly');
            }

            // Extract data
            const data = await page.evaluate(() => {
                return {
                    title: document.title,
                };
            });

            await Dataset.pushData(data);

        } catch (error) {
            log.error(`Failed to scrape ${request.url}: ${error.message}`);
            // Don't throw - let Crawlee handle retry
        }
    },

    failedRequestHandler({ request, error }, { log }) {
        log.error(`Request failed after retries: ${request.url}`);
    },

    maxRequestRetries: 3,
    requestHandlerTimeoutSecs: 60,
});
```

## Performance Optimization

### 1. Block Unnecessary Resources

```javascript
const crawler = new PlaywrightCrawler({
    preNavigationHooks: [async ({ page, request }) => {
        // Block images, fonts, etc.
        await page.route('**/*', (route) => {
            const resourceType = route.request().resourceType();
            if (['image', 'font', 'media'].includes(resourceType)) {
                route.abort();
            } else {
                route.continue();
            }
        });
    }],
    // ...
});
```

### 2. Use Headless Mode

```javascript
const crawler = new PlaywrightCrawler({
    headless: true, // Faster than headed mode
    // ...
});
```

### 3. Control Concurrency

```javascript
const crawler = new PlaywrightCrawler({
    maxConcurrency: 5, // Run 5 browsers in parallel
    maxRequestsPerMinute: 60, // Rate limiting
    // ...
});
```

## Best Practices

### ✅ DO:

- **Use role-based selectors** first (most stable)
- **Let Playwright auto-wait** (never use setTimeout)
- **Extract data with page.evaluate()** for complex queries
- **Handle errors gracefully** (try/catch, failedRequestHandler)
- **Block unnecessary resources** (images, fonts)
- **Use headless mode** for production
- **Respect rate limits** (maxRequestsPerMinute)
- **Log progress clearly** for debugging

### ❌ DON'T:

- **Use arbitrary timeouts** (`waitForTimeout(5000)`)
- **Use fragile CSS selectors** as first choice
- **Forget error handling** - pages fail!
- **Run too many concurrent browsers** - memory intensive
- **Scrape if API exists** - use API instead (10x faster)
- **Forget to close browsers** - memory leaks

## Related Resources

- **Sitemap discovery**: See `sitemap-discovery.md` (get URLs first)
- **API discovery**: See `api-discovery.md` (prefer APIs!)
- **Hybrid approach**: See `hybrid-approaches.md`
- **Selectors**: See `../reference/selector-guide.md`
- **Examples**: See `../examples/playwright-basic.js`

## Summary

**Playwright is powerful but resource-intensive** - use when necessary!

**Key takeaways**:
1. Check for APIs first (10x faster than Playwright)
2. Use role-based selectors (most stable)
3. Let Playwright auto-wait (no setTimeout)
4. Handle errors and retries properly
5. Block unnecessary resources for speed
6. Combine with sitemaps for complete coverage

```

### strategies/cheerio-scraping.md

```markdown
# Cheerio (HTTP-Only) Scraping

## Overview

Cheerio is a fast, lightweight library for parsing HTML using jQuery-like syntax. It's perfect for static HTML sites that don't require JavaScript rendering.

## When to Use Cheerio

### ✅ USE Cheerio when:
- Website serves static HTML (server-side rendered)
- No JavaScript rendering needed
- Simple HTML structure
- High-volume scraping (5x faster than Playwright)
- Low memory requirements
- API doesn't exist but HTML is simple

### ❌ DON'T use Cheerio when:
- Site requires JavaScript (React, Vue, Angular)
- Content loads dynamically via AJAX
- Need to interact with page (clicks, forms)
- Need to execute JavaScript
- Single-page application (SPA)

## Quick Example

```javascript
import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new Cheer ioCrawler({
    async requestHandler({ $, request, log }) {
        log.info(`Scraping: ${request.url}`);

        // Use jQuery-like selectors
        const title = $('h1').text().trim();
        const price = $('.price').text().trim();
        const description = $('.description').text().trim();

        // Extract multiple items
        const products = [];
        $('.product').each((index, element) => {
            products.push({
                name: $(element).find('.name').text(),
                price: $(element).find('.price').text(),
            });
        });

        await Dataset.pushData({
            title,
            price,
            description,
            products,
        });
    },
});

await crawler.run(['https://example.com']);
```

## jQuery Selectors

Cheerio uses the same selector syntax as jQuery:

```javascript
// Basic selectors
$('h1')                          // Tag
$('.class-name')                 // Class
$('#id')                         // ID
$('div.product')                 // Tag + class
$('a[href]')                     // Attribute exists
$('a[href="https://..."]')       // Attribute value

// Hierarchy
$('div > p')                     // Direct child
$('div p')                       // Descendant
$('div + p')                     // Next sibling

// Traversal
$('h1').parent()                 // Parent element
$('div').find('.price')          // Find descendant
$('li').first()                  // First element
$('li').last()                   // Last element
$('li').eq(2)                    // Element at index

// Extraction
$('h1').text()                   // Text content
$('img').attr('src')             // Attribute value
$('div').html()                  // Inner HTML
```

## Common Patterns

### Pattern 1: Extract Single Values

```javascript
async requestHandler({ $, request }) {
    const data = {
        title: $('h1.title').text().trim(),
        price: $('.price').first().text().trim(),
        image: $('img.main-image').attr('src'),
        description: $('.description').text().trim(),
        rating: $('[data-rating]').attr('data-rating'),
    };

    await Dataset.pushData(data);
}
```

### Pattern 2: Extract Lists

```javascript
async requestHandler({ $, request }) {
    const products = [];

    $('.product-item').each((index, element) => {
        const $el = $(element);

        products.push({
            name: $el.find('.name').text().trim(),
            price: $el.find('.price').text().trim(),
            url: $el.find('a').attr('href'),
            image: $el.find('img').attr('src'),
        });
    });

    await Dataset.pushData(products);
}
```

### Pattern 3: Follow Links (Crawling)

```javascript
const crawler = new CheerioCrawler({
    async requestHandler({ $, request, enqueueLinks }) {
        // Extract data from current page
        const products = [];
        $('.product').each((i, el) => {
            products.push({
                name: $(el).find('.name').text(),
                price: $(el).find('.price').text(),
            });
        });

        await Dataset.pushData(products);

        // Enqueue links to other pages
        await enqueueLinks({
            selector: 'a.product-link',
            strategy: 'same-domain',
        });
    },
    maxRequestsPerCrawl: 100,
});

await crawler.run(['https://example.com']);
```

## Performance Comparison

| Metric | Cheerio | Playwright | Difference |
|--------|---------|-----------|------------|
| **Speed** | Very fast | Slow | 5-10x faster |
| **Memory** | Low (~50 MB) | High (~500 MB) | 10x less |
| **CPU** | Low | High | 5-10x less |
| **Concurrency** | High (50+) | Low (5-10) | Can run more in parallel |

**When scraping 1000 pages**:
- Cheerio: 5-10 minutes
- Playwright: 30-60 minutes

## Best Practices

### ✅ DO:

- **Use for static HTML sites**
- **High concurrency** (30-50 parallel requests)
- **Chain selectors** for complex queries
- **Trim text content** (`.text().trim()`)
- **Handle missing elements** (`?.`)
- **Combine with sitemaps** for URL discovery

### ❌ DON'T:

- **Use for JavaScript-heavy sites** (use Playwright)
- **Expect JavaScript execution**
- **Forget to handle missing elements**
- **Skip rate limiting** (respect robots.txt)

## Complete Example

```javascript
import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    maxConcurrency: 30, // High concurrency for Cheerio
    maxRequestsPerMinute: 120,

    async requestHandler({ $, request, log }) {
        log.info(`Scraping: ${request.url}`);

        try {
            const data = {
                url: request.url,
                title: $('h1').text().trim(),
                price: $('.price').first().text().trim(),
                images: $('img.product-image')
                    .map((i, el) => $(el).attr('src'))
                    .get(),
                specs: {},
            };

            // Extract specifications
            $('.spec-row').each((i, el) => {
                const key = $(el).find('.spec-name').text().trim();
                const value = $(el).find('.spec-value').text().trim();
                data.specs[key] = value;
            });

            await Dataset.pushData(data);

        } catch (error) {
            log.error(`Error scraping ${request.url}: ${error.message}`);
        }
    },

    failedRequestHandler({ request }, { log }) {
        log.error(`Request failed: ${request.url}`);
    },
});

await crawler.run(['https://example.com']);
```

## When to Use Playwright Instead

Switch to Playwright if you see:
- Empty content (JavaScript-rendered)
- Infinite scroll
- "Load More" buttons
- Content appears after delay
- React/Vue/Angular indicators

## Related Resources

- **For JavaScript sites**: See `playwright-scraping.md`
- **For APIs**: See `api-discovery.md`
- **Selectors**: See `../reference/selector-guide.md`

## Summary

**Cheerio is 5-10x faster than Playwright** for static HTML!

**Key takeaways**:
1. Use for static HTML sites only
2. 5-10x faster than Playwright
3. High concurrency possible (30-50 parallel)
4. jQuery-like syntax (easy to use)
5. Falls back to Playwright for JavaScript sites

```

### strategies/hybrid-approaches.md

```markdown
# Hybrid Scraping Approaches

## Overview

Combine multiple strategies for optimal speed, reliability, and data quality. Hybrid approaches leverage the strengths of each method.

## Common Hybrid Patterns

### Pattern 1: Sitemap + API (Best Performance)

**Use case**: Site has sitemap + hidden API

**Advantages**:
- Instant URL discovery (sitemap)
- Clean structured data (API)
- 60x faster than crawling + scraping
- Most reliable data format

**Example**:
```javascript
import { RobotsFile } from 'crawlee';
import { gotScraping } from 'got-scraping';

// 1. Get all URLs from sitemap
const robots = await RobotsFile.find('https://shop.com');
const urls = await robots.parseUrlsFromSitemaps();

// 2. Extract IDs from URLs
const productIds = urls
    .map(url => url.match(/\/products\/(\d+)/)?.[1])
    .filter(Boolean);

console.log(`Found ${productIds.length} products`);

// 3. Fetch data via API (clean JSON)
for (const id of productIds) {
    const response = await gotScraping({
        url: `https://api.shop.com/v1/products/${id}`,
        responseType: 'json',
    });

    console.log(response.body); // Clean, structured data
}
```

**Performance**:
- URL discovery: 5-10 seconds (sitemap)
- Data fetching: 2-5 minutes (API)
- Total: ~5 minutes for 1000 products

### Pattern 2: Sitemap + Playwright

**Use case**: Site has sitemap but no API

**Advantages**:
- Fast URL discovery (sitemap)
- Can handle JavaScript (Playwright)
- No need to crawl for URLs

**Example**:
```javascript
import { PlaywrightCrawler, RobotsFile, Dataset } from 'crawlee';

// 1. Get URLs from sitemap
const robots = await RobotsFile.find('https://example.com');
const urls = await robots.parseUrlsFromSitemaps();

// 2. Scrape pages with Playwright
const crawler = new PlaywrightCrawler({
    maxConcurrency: 5,

    async requestHandler({ page, request, log }) {
        log.info(`Scraping: ${request.url}`);

        const data = await page.evaluate(() => ({
            title: document.querySelector('h1')?.textContent,
            price: document.querySelector('.price')?.textContent,
        }));

        await Dataset.pushData({ url: request.url, ...data });
    },
});

await crawler.addRequests(urls);
await crawler.run();
```

**Performance**:
- URL discovery: 5-10 seconds
- Scraping: 10-20 minutes (for 1000 pages)
- Total: ~20 minutes

### Pattern 3: Iterative Fallback

**Use case**: Unknown site, try simplest first

**Advantages**:
- Start with fastest approach
- Automatically fallback if fails
- Optimal for unknown sites

**Example**:
```javascript
async function scrapeWithFallback(url) {
    // Try 1: Sitemap + API
    try {
        console.log('Attempting: Sitemap + API...');
        const robots = await RobotsFile.find(url);
        const urls = await robots.parseUrlsFromSitemaps();

        if (urls.length > 0) {
            // Check for API
            const apiUrl = await discoverAPI(url);
            if (apiUrl) {
                console.log('✓ Using Sitemap + API (fastest)');
                return await scrapeSitemapAPI(urls, apiUrl);
            }
        }
    } catch (error) {
        console.log('✗ Sitemap + API failed');
    }

    // Try 2: Sitemap + Playwright
    try {
        console.log('Attempting: Sitemap + Playwright...');
        const robots = await RobotsFile.find(url);
        const urls = await robots.parseUrlsFromSitemaps();

        if (urls.length > 0) {
            console.log('✓ Using Sitemap + Playwright');
            return await scrapeSitemapPlaywright(urls);
        }
    } catch (error) {
        console.log('✗ Sitemap + Playwright failed');
    }

    // Try 3: Pure Playwright crawling
    try {
        console.log('Attempting: Playwright crawling...');
        console.log('✓ Using Playwright crawling (fallback)');
        return await scrapePlaywrightCrawl(url);
    } catch (error) {
        console.log('✗ All methods failed');
        throw error;
    }
}
```

### Pattern 4: API + Playwright Fallback

**Use case**: API for most data, Playwright for missing fields

**Advantages**:
- Fast API for core data
- Playwright for complex fields (reviews, dynamic content)
- Best data quality

**Example**:
```javascript
import { gotScraping } from 'got-scraping';
import { chromium } from 'playwright';

async function scrapeProduct(productId) {
    // 1. Get core data from API (fast)
    const apiData = await gotScraping({
        url: `https://api.shop.com/products/${productId}`,
        responseType: 'json',
    });

    // 2. Get complex data with Playwright
    const browser = await chromium.launch();
    const page = await browser.newPage();

    await page.goto(`https://shop.com/products/${productId}`);

    // Scrape reviews (dynamic content)
    const reviews = await page.evaluate(() => {
        return Array.from(document.querySelectorAll('.review')).map(el => ({
            rating: el.querySelector('.rating')?.textContent,
            text: el.querySelector('.text')?.textContent,
        }));
    });

    await browser.close();

    // 3. Combine data
    return {
        ...apiData.body,
        reviews,
    };
}
```

### Pattern 5: Cheerio + Playwright Hybrid

**Use case**: Most pages static, some dynamic

**Advantages**:
- Fast Cheerio for static pages
- Playwright only when needed
- Optimal resource usage

**Example**:
```javascript
import { CheerioCrawler, PlaywrightCrawler, Dataset } from 'crawlee';

async function scrapeHybrid(urls) {
    // Try Cheerio first (fast)
    const cheerioCrawler = new CheerioCrawler({
        maxConcurrency: 30,
        async requestHandler({ $, request }) {
            // Check if content is present
            const title = $('h1').text();

            if (!title) {
                // Content missing (JavaScript-rendered)
                console.log(`Cheerio failed for ${request.url}, using Playwright...`);
                await playwrightCrawler.addRequests([request.url]);
                return;
            }

            // Extract with Cheerio (fast)
            await Dataset.pushData({
                url: request.url,
                title: title,
                price: $('.price').text(),
            });
        },
    });

    // Playwright for JavaScript pages
    const playwrightCrawler = new PlaywrightCrawler({
        async requestHandler({ page, request }) {
            const data = await page.evaluate(() => ({
                title: document.querySelector('h1')?.textContent,
                price: document.querySelector('.price')?.textContent,
            }));

            await Dataset.pushData({ url: request.url, ...data });
        },
    });

    await cheerioCrawler.run(urls);
}
```

## Decision Matrix

| Scenario | Best Approach | Speed | Data Quality |
|----------|---------------|-------|--------------|
| Sitemap + API exist | Sitemap + API | ⚡⚡⚡⚡⚡ | ⭐⭐⭐⭐⭐ |
| Sitemap + No API + Static | Sitemap + Cheerio | ⚡⚡⚡⚡ | ⭐⭐⭐⭐ |
| Sitemap + No API + Dynamic | Sitemap + Playwright | ⚡⚡⚡ | ⭐⭐⭐⭐ |
| No Sitemap + API | API Discovery | ⚡⚡⚡⭐ | ⭐⭐⭐⭐⭐ |
| Unknown Site | Iterative Fallback | ⚡⚡⚡ | ⭐⭐⭐⭐ |
| Mixed Static/Dynamic | Cheerio + Playwright | ⚡⚡⚡⚡ | ⭐⭐⭐⭐ |

## Best Practices

### ✅ DO:

- **Start with simplest approach** (sitemap/API)
- **Fallback to complex methods** if simple fails
- **Test small batch first** (5-10 items)
- **Log which method succeeded** for debugging
- **Combine strengths** (sitemap URLs + API data)
- **Use Cheerio for static content** (5x faster)
- **Reserve Playwright for when needed** (resource-intensive)

### ❌ DON'T:

- **Use Playwright if Cheerio works** (waste of resources)
- **Skip API discovery** (always check first!)
- **Forget fallback strategies** (sites change)
- **Mix approaches randomly** (be systematic)

## Complete Example: Full Hybrid

```javascript
import { RobotsFile } from 'crawlee';
import { gotScraping } from 'got-scraping';
import { chromium } from 'playwright';

async function scrapeWebsite(baseUrl) {
    console.log('🔍 Phase 1: Discovery');

    // Check for sitemap
    const robots = await RobotsFile.find(baseUrl);
    const sitemapUrls = await robots.parseUrlsFromSitemaps();

    console.log(`Found ${sitemapUrls.length} URLs in sitemap`);

    // Extract product IDs
    const productIds = sitemapUrls
        .map(url => url.match(/\/products\/(\d+)/)?.[1])
        .filter(Boolean);

    console.log(`Extracted ${productIds.length} product IDs`);

    console.log('🔍 Phase 2: API Discovery');

    // Try API first
    try {
        const testId = productIds[0];
        const apiResponse = await gotScraping({
            url: `https://api.${baseUrl.replace('https://', '')}/products/${testId}`,
            responseType: 'json',
            timeout: { request: 5000 },
        });

        console.log('✓ API found! Using API for data');

        // Use API for all products
        const results = [];
        for (const id of productIds) {
            const data = await gotScraping({
                url: `https://api.${baseUrl.replace('https://', '')}/products/${id}`,
                responseType: 'json',
            });
            results.push(data.body);
        }

        return results;

    } catch (error) {
        console.log('✗ No API found, using Playwright');
    }

    console.log('🔍 Phase 3: Playwright Scraping');

    // Fallback to Playwright
    const browser = await chromium.launch();
    const results = [];

    for (const url of sitemapUrls.slice(0, 10)) { // Test with 10 first
        const page = await browser.newPage();
        await page.goto(url);

        const data = await page.evaluate(() => ({
            title: document.querySelector('h1')?.textContent,
            price: document.querySelector('.price')?.textContent,
        }));

        results.push({ url, ...data });
        await page.close();
    }

    await browser.close();
    return results;
}

// Usage
const data = await scrapeWebsite('https://example.com');
console.log(`Scraped ${data.length} products`);
```

## Related Resources

- **Sitemap**: See `sitemap-discovery.md`
- **API**: See `api-discovery.md`
- **Playwright**: See `playwright-scraping.md`
- **Cheerio**: See `cheerio-scraping.md`
- **Examples**: See `../examples/hybrid-sitemap-api.js`
- **Examples**: See `../examples/iterative-fallback.js`

## Summary

**Hybrid approaches combine the best of each method!**

**Key takeaways**:
1. Sitemap + API = fastest (60x faster than crawling)
2. Start simple, fallback to complex
3. Test small batch first
4. Log which method succeeded
5. Combine strengths for optimal results

```

### reference/fingerprint-patterns.md

```markdown
# Fingerprint Configuration Patterns

Quick reference for common fingerprint-suite configurations.

## Basic Patterns

### Desktop Chrome (Windows/Mac) - Most Common

```typescript
fingerprintOptions: {
    devices: ['desktop'],
    operatingSystems: ['windows', 'macos'],
    browsers: ['chrome'],
}
```

**Use for**: Most websites, general scraping

### Mobile iPhone (iOS Safari)

```typescript
fingerprintOptions: {
    devices: ['mobile'],
    operatingSystems: ['ios'],
    browsers: ['safari'],
}
```

**Use for**: Mobile-specific sites, iOS apps' web views

### Mobile Android (Chrome)

```typescript
fingerprintOptions: {
    devices: ['mobile'],
    operatingSystems: ['android'],
    browsers: ['chrome'],
}
```

**Use for**: Mobile sites, Android apps' web views

### Desktop Firefox (Linux)

```typescript
fingerprintOptions: {
    devices: ['desktop'],
    operatingSystems: ['linux'],
    browsers: ['firefox'],
}
```

**Use for**: Sites blocking Chrome, developer-focused sites

## Advanced Patterns

### Random Desktop Browser

```typescript
fingerprintOptions: {
    devices: ['desktop'],
    operatingSystems: ['windows', 'macos', 'linux'],
    browsers: ['chrome', 'firefox', 'edge'],
}
```

**Use for**: Maximum variety, avoiding pattern detection

### Windows Only (Corporate Environment)

```typescript
fingerprintOptions: {
    devices: ['desktop'],
    operatingSystems: ['windows'],
    browsers: ['chrome', 'edge'],
}
```

**Use for**: Business/enterprise sites

### Mobile Only (Both Platforms)

```typescript
fingerprintOptions: {
    devices: ['mobile'],
    operatingSystems: ['ios', 'android'],
}
```

**Use for**: Mobile-first websites

## Proxy + Fingerprint Combinations

### Residential Proxy + Desktop

```typescript
const crawler = new PlaywrightCrawler({
    fingerprintOptions: {
        devices: ['desktop'],
        operatingSystems: ['windows', 'macos'],
        browsers: ['chrome'],
    },
    proxyConfiguration: await Actor.createProxyConfiguration({
        groups: ['RESIDENTIAL'],
    }),
});
```

**Best for**: Strict anti-bot sites (e-commerce, social media)

### Datacenter Proxy + Mobile

```typescript
const crawler = new PlaywrightCrawler({
    fingerprintOptions: {
        devices: ['mobile'],
        operatingSystems: ['android'],
    },
    proxyConfiguration: await Actor.createProxyConfiguration({
        groups: ['SHADER'],
    }),
});
```

**Best for**: Mobile scraping with moderate protection

### No Proxy + Fingerprint

```typescript
const crawler = new PlaywrightCrawler({
    fingerprintOptions: {
        devices: ['desktop'],
        operatingSystems: ['windows'],
        browsers: ['chrome'],
    },
    // No proxy - use local IP
});
```

**Best for**: Sites with light protection, testing

## Session Configuration Patterns

### Aggressive Rotation (Very Strict Sites)

```typescript
useSessionPool: true,
sessionPoolOptions: {
    maxPoolSize: 100,
    sessionOptions: {
        maxUsageCount: 5,      // Rotate after just 5 requests
        maxErrorScore: 1,       // Retire after single error
    },
},
fingerprintOptions: {
    devices: ['desktop'],
    operatingSystems: ['windows', 'macos'],
    browsers: ['chrome'],
}
```

### Balanced Rotation (Normal Sites)

```typescript
useSessionPool: true,
sessionPoolOptions: {
    maxPoolSize: 50,
    sessionOptions: {
        maxUsageCount: 30,     // 30 requests per session
        maxErrorScore: 3,       // 3 errors allowed
    },
},
fingerprintOptions: {
    devices: ['desktop'],
    browsers: ['chrome'],
}
```

### Minimal Rotation (Light Protection)

```typescript
useSessionPool: true,
sessionPoolOptions: {
    maxPoolSize: 10,
    sessionOptions: {
        maxUsageCount: 100,    // Many requests per session
        maxErrorScore: 5,       // Tolerate more errors
    },
},
fingerprintOptions: {
    devices: ['desktop'],
}
```

## Common Use Cases

### E-commerce Scraping

```typescript
fingerprintOptions: {
    devices: ['desktop', 'mobile'],  // Both devices
    operatingSystems: ['windows', 'macos', 'ios', 'android'],
    browsers: ['chrome', 'safari'],
}
proxyConfiguration: await Actor.createProxyConfiguration({
    groups: ['RESIDENTIAL'],  // Residential proxies
    countryCode: 'US',        // Target country
})
```

### Social Media Scraping

```typescript
fingerprintOptions: {
    devices: ['mobile'],              // Mobile-first
    operatingSystems: ['ios', 'android'],
    browsers: ['safari', 'chrome'],
}
proxyConfiguration: await Actor.createProxyConfiguration({
    groups: ['RESIDENTIAL'],
})
sessionPoolOptions: {
    maxUsageCount: 10,                // Rotate frequently
}
```

### News/Content Sites

```typescript
fingerprintOptions: {
    devices: ['desktop'],
    operatingSystems: ['windows', 'macos'],
    browsers: ['chrome', 'firefox'],
}
proxyConfiguration: await Actor.createProxyConfiguration({
    groups: ['SHADER'],               // Datacenter OK
})
```

### Travel/Booking Sites

```typescript
fingerprintOptions: {
    devices: ['desktop'],
    operatingSystems: ['windows', 'macos'],
    browsers: ['chrome'],
}
proxyConfiguration: await Actor.createProxyConfiguration({
    groups: ['RESIDENTIAL'],
    countryCode: 'US',                // Match target market
})
sessionPoolOptions: {
    maxUsageCount: 20,
    sessionOptions: {
        maxAgeSecs: 3600,             // 1-hour session lifetime
    },
}
```

## Testing Patterns

### Test Your Fingerprint

```typescript
// Visit bot detection sites
const testUrls = [
    'https://bot.sannysoft.com/',
    'https://browserleaks.com/canvas',
    'https://www.whatismybrowser.com/',
];

const crawler = new PlaywrightCrawler({
    fingerprintOptions: {
        devices: ['desktop'],
        browsers: ['chrome'],
    },
    async requestHandler({ page, request, log }) {
        await page.screenshot({ path: `test-${Date.now()}.png` });
        log.info(`Tested: ${request.url}`);
    },
});

await crawler.run(testUrls);
```

## Troubleshooting Patterns

### Pattern 1: Getting Blocked Despite Fingerprints

**Try escalating**:

```typescript
// From this (basic)
fingerprintOptions: {
    devices: ['desktop'],
}

// To this (specific + proxy)
fingerprintOptions: {
    devices: ['desktop'],
    operatingSystems: ['windows'],
    browsers: ['chrome'],
}
proxyConfiguration: await Actor.createProxyConfiguration({
    groups: ['RESIDENTIAL'],
})
```

### Pattern 2: Mismatched Fingerprint

**Wrong** (iOS fingerprint with Chromium):
```typescript
// Using chromium.launch()
fingerprintOptions: {
    devices: ['mobile'],
    operatingSystems: ['ios'],  // ❌ Wrong - iOS uses Safari/WebKit
    browsers: ['safari'],
}
```

**Right**:
```typescript
// Using chromium.launch()
fingerprintOptions: {
    devices: ['mobile'],
    operatingSystems: ['android'],  // ✅ Right - Android uses Chrome
    browsers: ['chrome'],
}
```

### Pattern 3: Too Many Variables

**Inefficient** (random everything):
```typescript
fingerprintOptions: {
    devices: ['desktop', 'mobile'],
    operatingSystems: ['windows', 'macos', 'linux', 'ios', 'android'],
    browsers: ['chrome', 'firefox', 'safari', 'edge'],
}
```

**Better** (focused):
```typescript
fingerprintOptions: {
    devices: ['desktop'],
    operatingSystems: ['windows', 'macos'],
    browsers: ['chrome'],
}
```

## Quick Reference Table

| Use Case | Device | OS | Browser | Proxy |
|----------|--------|----|---------| ------|
| General scraping | desktop | windows, macos | chrome | SHADER |
| E-commerce | desktop + mobile | all | chrome, safari | RESIDENTIAL |
| Social media | mobile | ios, android | safari, chrome | RESIDENTIAL |
| News sites | desktop | windows, macos | chrome, firefox | SHADER |
| Booking sites | desktop | windows, macos | chrome | RESIDENTIAL + country |
| Mobile apps | mobile | ios or android | safari or chrome | RESIDENTIAL |
| Testing | desktop | windows | chrome | none |

## Copy-Paste Snippets

### Snippet 1: Full Crawlee Setup

```typescript
import { PlaywrightCrawler } from 'crawlee';
import { Actor } from 'apify';

const crawler = new PlaywrightCrawler({
    useSessionPool: true,
    sessionPoolOptions: {
        maxPoolSize: 50,
        sessionOptions: { maxUsageCount: 30 },
    },
    fingerprintOptions: {
        devices: ['desktop'],
        operatingSystems: ['windows', 'macos'],
        browsers: ['chrome'],
    },
    proxyConfiguration: await Actor.createProxyConfiguration({
        groups: ['RESIDENTIAL'],
    }),
    async requestHandler({ page }) {
        // Your code here
    },
});
```

### Snippet 2: Playwright Only

```typescript
import { chromium } from 'playwright';
import { newInjectedContext } from 'fingerprint-injector';

const browser = await chromium.launch();
const context = await newInjectedContext(browser, {
    fingerprintOptions: {
        devices: ['desktop'],
        browsers: ['chrome'],
    },
});
const page = await context.newPage();
```

### Snippet 3: Mobile Specific

```typescript
fingerprintOptions: {
    devices: ['mobile'],
    operatingSystems: ['ios'],
    browsers: ['safari'],
}
```

## Related

- **Complete guide**: See `../strategies/anti-blocking.md`
- **Anti-patterns**: See `anti-patterns.md`

---

**Remember**: Match fingerprint to actual browser (Chrome fingerprint with Chromium browser)!

```

### apify/cli-workflow.md

```markdown
# Apify CLI Workflow

## Overview

**CRITICAL: Always use `apify create` command when starting a new Actor.**

This is THE recommended and ONLY proper way to initialize Apify Actors.

## Why apify create is CRITICAL

### ✅ Auto-Generated Files

The `apify create` command generates:

- ✅ `package.json` with correct dependencies and scripts
- ✅ `.actor/actor.json` with proper structure
- ✅ `.actor/input_schema.json` template
- ✅ `Dockerfile` with correct base image
- ✅ `tsconfig.json` (for TypeScript templates)
- ✅ `eslint.config.js` for code quality
- ✅ `.gitignore` with Apify-specific entries
- ✅ `storage/` directory structure
- ✅ `README.md` template
- ✅ Example source code

### ✅ Proper Tooling Setup

Automatically configures:

- ESLint for code quality
- TypeScript compilation (for TS templates)
- npm scripts (`start`, `build`, `test`)
- Apify SDK with correct version
- Crawlee with correct version

### ❌ What Happens Without apify create

Manual setup leads to:

- ❌ Missing ESLint configuration
- ❌ Incorrect dependencies/versions
- ❌ Poor project structure
- ❌ Missing `.actor/` directory
- ❌ Incorrect Dockerfile
- ❌ More debugging time
- ❌ Deployment failures

## Step-by-Step Workflow

### Step 1: Install Apify CLI

```bash
# Check if already installed
apify --version

# If not installed
npm install -g apify-cli

# Verify installation
apify --version
```

### Step 2: Login to Apify

```bash
# Login (required for push/deployment)
apify login

# This opens browser for authentication
```

### Step 3: Create New Actor

```bash
# Create actor
apify create my-scraper

# You'll be prompted:
# → What type of Actor do you want to create?
```

### Step 4: Choose Template

**Choose based on site type** (see `../workflows/productionization.md` for decision tree):

Available TypeScript templates:

| Template | Best For | Speed |
|----------|----------|-------|
| **project_cheerio_crawler_ts** | Static HTML/SSR (RECOMMENDED) | ~10x faster |
| **project_playwright_crawler_ts** | JavaScript-heavy sites | Standard |
| **project_playwright_camoufox_crawler_ts** | Anti-bot challenges | Standard |

**Selection guide**:
- Static HTML → `project_cheerio_crawler_ts` (fastest)
- JavaScript/SPA → `project_playwright_crawler_ts`
- Being blocked → `project_playwright_camoufox_crawler_ts`

```
? What type of Actor do you want to create?
❯ project_cheerio_crawler_ts (TypeScript + Cheerio)
  project_playwright_crawler_ts (TypeScript + Playwright)
  project_playwright_camoufox_crawler_ts (TypeScript + Camoufox)
  project_puppeteer_crawler_ts (TypeScript + Puppeteer)
```

### Step 5: Navigate to Project

```bash
cd my-scraper

# View generated structure
ls -la
```

### Step 6: Review Generated Files

```
my-scraper/
├── .actor/
│   ├── actor.json                 ← Actor configuration
│   └── input_schema.json          ← Input validation
├── src/
│   └── main.ts                    ← Your code here
├── storage/                       ← Local storage
├── .dockerignore
├── .gitignore
├── .prettierrc
├── AGENTS.md                      ← AI agent guidance (Apify-maintained)
├── Dockerfile                     ← Production build
├── eslint.config.js               ← Code quality
├── package.json                   ← Dependencies & scripts
├── tsconfig.json                  ← TypeScript config
└── README.md                      ← Documentation
```

**Important**: The template includes `AGENTS.md`, official Apify documentation for AI agents working with Actors. This file provides:
- Do/Don't patterns for Actor development
- Input/output schema detailed specifications
- Dataset and key-value store schema patterns
- Safety and permission guidelines
- Apify SDK best practices

See `agents-md-guide.md` in this directory for how AGENTS.md complements this skill.

### Step 7: Install Dependencies

```bash
npm install
```

### Step 8: Develop Your Actor

Edit `src/main.ts`:

```typescript
import { Actor } from 'apify';
import { PlaywrightCrawler, Dataset } from 'crawlee';

await Actor.main(async () => {
    const input = await Actor.getInput();

    const crawler = new PlaywrightCrawler({
        async requestHandler({ page, request }) {
            // Your scraping logic here
        },
    });

    await crawler.run(input.startUrls);
});
```

### Step 9: Test Locally

```bash
# Run actor locally
apify run

# With specific input
apify run --input='{"startUrls":[{"url":"https://example.com"}]}'

# Debug mode
DEBUG=crawlee:* apify run
```

### Step 10: Build (TypeScript Only)

```bash
# Compile TypeScript
npm run build

# Output in dist/ directory
```

### Step 11: Push to Apify Platform

```bash
# Deploy to Apify
apify push

# With specific build tag
apify push --build-tag beta

# Force rebuild
apify push --force
```

### Step 12: Call Your Actor

```bash
# Run actor on Apify platform
apify call my-scraper

# With input
apify call my-scraper --input='{"startUrls":[{"url":"https://example.com"}]}'
```

## Complete CLI Command Reference

### Project Management

```bash
# Create new actor
apify create [name]

# Initialize in existing directory
apify init

# Login/logout
apify login
apify logout

# Check login status
apify info
```

### Development

```bash
# Run locally
apify run
apify run --purge           # Clear storage first
apify run --input-file=input.json

# Run specific actor
apify call [actor-id]
apify call [actor-id] --build=beta
```

### Deployment

```bash
# Push to platform
apify push
apify push --build-tag [tag]
apify push --version-number [version]
apify push --wait-for-finish

# Pull actor from platform
apify pull [actor-id]
```

### Storage Management

```bash
# Manage datasets
apify dataset ls
apify dataset get [id]

# Manage key-value stores
apify kv-store ls
apify kv-store get [id]
```

## npm Scripts (Generated by apify create)

The CLI generates these useful scripts:

```json
{
    "scripts": {
        "start": "npm run build && node dist/main.js",
        "build": "tsc",
        "test": "echo \"No tests yet\"",
        "lint": "eslint src",
        "lint:fix": "eslint src --fix"
    }
}
```

Usage:

```bash
npm start          # Build and run
npm run build      # Compile TypeScript
npm test           # Run tests
npm run lint       # Check code quality
npm run lint:fix   # Auto-fix linting issues
```

## Development Workflow

### Typical Development Cycle

```bash
# 1. Create actor
apify create my-scraper
cd my-scraper

# 2. Develop
# Edit src/main.ts

# 3. Test locally
apify run

# 4. Fix issues, repeat step 3

# 5. Lint code
npm run lint:fix

# 6. Push to platform
apify push

# 7. Test on platform
apify call my-scraper

# 8. Iterate
# Edit code, repeat from step 3
```

## Common Issues

### Issue: "Command not found: apify"

**Solution**:
```bash
npm install -g apify-cli
```

### Issue: "Not logged in"

**Solution**:
```bash
apify login
```

### Issue: Build fails

**Solution**:
```bash
# Check TypeScript errors
npm run build

# Fix errors in src/
# Then try again:
apify push
```

## Anti-Pattern: Manual Creation

### ❌ DON'T Do This

```bash
# BAD: Manual setup
mkdir my-actor
cd my-actor
npm init -y
npm install apify crawlee
# ... missing tons of configuration
```

**Why this is wrong**:
- Missing `.actor/` directory
- No input schema
- Incorrect Dockerfile
- No ESLint config
- No TypeScript setup
- Missing npm scripts
- Will fail deployment

### ✅ DO This Instead

```bash
# GOOD: Use CLI
apify create my-actor
cd my-actor
# Everything configured correctly!
```

## Best Practices

### ✅ DO:

- **Always use `apify create`** (not manual setup)
- **Choose appropriate template** based on site type (see decision tree in productionization guide)
- **Test locally first** with `apify run`
- **Use build tags** for staging (`--build-tag beta`)
- **Keep CLI updated** (`npm update -g apify-cli`)
- **Use `.env` file** for local secrets
- **Commit to git** (except storage/, dist/)

### ❌ DON'T:

- **Create actors manually** - use CLI!
- **Skip local testing** - test before push
- **Hardcode secrets** - use environment variables
- **Push without building** (TypeScript actors)
- **Ignore linting errors** - fix them
- **Skip version tags** - use semantic versioning

## Resources

- [Apify CLI Docs](https://docs.apify.com/cli)
- [CLI Reference](https://docs.apify.com/cli/docs/reference)
- [Actor Development](https://docs.apify.com/platform/actors)

## Summary

**The Apify CLI is THE way to create Actors**

**Key commands**:
1. `apify create` - Create new actor (CRITICAL)
2. `apify run` - Test locally
3. `apify push` - Deploy to platform
4. `apify call` - Run on platform

**Remember**: Always use `apify create`, never manual setup!

```

### apify/input-schemas.md

```markdown
# Input Schema Patterns

Patterns for defining Actor input validation in `.actor/input_schema.json`.

## Schema Structure

```json
{
    "title": "Actor Input",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "fieldName": {
            "title": "Field Label",
            "type": "string",
            "description": "Help text",
            "editor": "textfield"
        }
    },
    "required": ["fieldName"]
}
```

## Common Field Types

### String Field

```json
{
    "url": {
        "title": "URL",
        "type": "string",
        "description": "Website URL to scrape",
        "editor": "textfield",
        "pattern": "https?://.+",
        "example": "https://example.com"
    }
}
```

### Number Field

```json
{
    "maxItems": {
        "title": "Maximum items",
        "type": "integer",
        "description": "Max number of items to scrape",
        "editor": "number",
        "default": 100,
        "minimum": 1,
        "maximum": 10000
    }
}
```

### Boolean Field

```json
{
    "saveHtml": {
        "title": "Save HTML",
        "type": "boolean",
        "description": "Save raw HTML",
        "editor": "checkbox",
        "default": false
    }
}
```

### Array of URLs

```json
{
    "startUrls": {
        "title": "Start URLs",
        "type": "array",
        "description": "List of URLs to scrape",
        "editor": "requestListSources",
        "placeholderValue": [{"url": "https://example.com"}],
        "minItems": 1
    }
}
```

### Select Dropdown

```json
{
    "mode": {
        "title": "Scraping mode",
        "type": "string",
        "description": "Choose scraping strategy",
        "editor": "select",
        "enum": ["fast", "thorough", "balanced"],
        "enumTitles": ["Fast", "Thorough", "Balanced"],
        "default": "balanced"
    }
}
```

### Object Field

```json
{
    "proxyConfiguration": {
        "title": "Proxy configuration",
        "type": "object",
        "description": "Proxy settings",
        "editor": "proxy",
        "default": {"useApifyProxy": true}
    }
}
```

### Text Area

```json
{
    "customJs": {
        "title": "Custom JavaScript",
        "type": "string",
        "description": "Custom page function",
        "editor": "javascript",
        "prefill": "async ({ page }) => {\n    // Your code\n}"
    }
}
```

### Hidden Field

```json
{
    "version": {
        "title": "Version",
        "type": "string",
        "description": "Internal version",
        "editor": "hidden",
        "default": "1.0.0"
    }
}
```

## Complete Examples

### Pattern 1: Basic Scraper

```json
{
    "title": "Basic Scraper Input",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "startUrls": {
            "title": "Start URLs",
            "type": "array",
            "description": "URLs to scrape",
            "editor": "requestListSources",
            "minItems": 1
        },
        "maxItems": {
            "title": "Maximum items",
            "type": "integer",
            "description": "Max results",
            "editor": "number",
            "default": 100,
            "minimum": 1
        }
    },
    "required": ["startUrls"]
}
```

### Pattern 2: E-commerce Scraper

```json
{
    "title": "E-commerce Scraper Input",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "startUrls": {
            "title": "Product URLs",
            "type": "array",
            "description": "Product pages to scrape",
            "editor": "requestListSources"
        },
        "maxItems": {
            "title": "Max products",
            "type": "integer",
            "description": "Maximum products to scrape",
            "editor": "number",
            "default": 1000
        },
        "includeReviews": {
            "title": "Include reviews",
            "type": "boolean",
            "description": "Scrape product reviews",
            "editor": "checkbox",
            "default": false
        },
        "minPrice": {
            "title": "Minimum price",
            "type": "number",
            "description": "Filter by minimum price",
            "editor": "number",
            "minimum": 0
        },
        "proxyConfiguration": {
            "title": "Proxy configuration",
            "type": "object",
            "description": "Proxy settings",
            "editor": "proxy"
        }
    },
    "required": ["startUrls"]
}
```

### Pattern 3: Advanced Scraper with Options

```json
{
    "title": "Advanced Scraper Input",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "startUrls": {
            "title": "Start URLs",
            "type": "array",
            "description": "URLs to scrape",
            "editor": "requestListSources"
        },
        "mode": {
            "title": "Scraping mode",
            "type": "string",
            "description": "Choose strategy",
            "editor": "select",
            "enum": ["sitemap", "api", "playwright", "hybrid"],
            "enumTitles": ["Sitemap", "API", "Playwright", "Hybrid"],
            "default": "hybrid"
        },
        "maxConcurrency": {
            "title": "Max concurrency",
            "type": "integer",
            "description": "Parallel requests",
            "editor": "number",
            "default": 5,
            "minimum": 1,
            "maximum": 50
        },
        "maxRequestsPerMinute": {
            "title": "Max requests/min",
            "type": "integer",
            "description": "Rate limit",
            "editor": "number",
            "default": 60
        },
        "useFingerprinting": {
            "title": "Use fingerprinting",
            "type": "boolean",
            "description": "Anti-blocking",
            "editor": "checkbox",
            "default": false
        },
        "proxyConfiguration": {
            "title": "Proxy configuration",
            "type": "object",
            "description": "Proxy settings",
            "editor": "proxy"
        }
    },
    "required": ["startUrls", "mode"]
}
```

### Pattern 4: API-based Scraper

```json
{
    "title": "API Scraper Input",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "apiUrl": {
            "title": "API URL",
            "type": "string",
            "description": "API endpoint",
            "editor": "textfield",
            "pattern": "https?://.+"
        },
        "apiKey": {
            "title": "API Key",
            "type": "string",
            "description": "Authentication key",
            "editor": "textfield",
            "isSecret": true
        },
        "pageSize": {
            "title": "Page size",
            "type": "integer",
            "description": "Items per page",
            "editor": "number",
            "default": 100
        },
        "maxPages": {
            "title": "Max pages",
            "type": "integer",
            "description": "Maximum pages to fetch",
            "editor": "number",
            "default": 10
        }
    },
    "required": ["apiUrl"]
}
```

### Pattern 5: Sitemap + Playwright

```json
{
    "title": "Sitemap Scraper Input",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "sitemapUrl": {
            "title": "Sitemap URL",
            "type": "string",
            "description": "URL to sitemap.xml",
            "editor": "textfield",
            "example": "https://example.com/sitemap.xml"
        },
        "urlPattern": {
            "title": "URL pattern (regex)",
            "type": "string",
            "description": "Filter URLs by regex",
            "editor": "textfield",
            "example": "/products/.*"
        },
        "maxItems": {
            "title": "Maximum items",
            "type": "integer",
            "description": "Max URLs to scrape",
            "editor": "number",
            "default": 1000
        },
        "proxyConfiguration": {
            "title": "Proxy configuration",
            "type": "object",
            "editor": "proxy"
        }
    },
    "required": ["sitemapUrl"]
}
```

### Pattern 6: With Custom Fields

```json
{
    "title": "Custom Scraper Input",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "startUrls": {
            "title": "Start URLs",
            "type": "array",
            "editor": "requestListSources"
        },
        "selectors": {
            "title": "Custom selectors",
            "type": "object",
            "description": "CSS selectors for data",
            "editor": "json",
            "prefill": "{\n  \"title\": \"h1\",\n  \"price\": \".price\"\n}"
        },
        "customFunction": {
            "title": "Custom function",
            "type": "string",
            "description": "Custom extraction logic",
            "editor": "javascript",
            "prefill": "async ({ page }) => {\n    return { title: await page.title() };\n}"
        }
    },
    "required": ["startUrls"]
}
```

## Field Editors

Available editor types:

| Editor | Use For | Type |
|--------|---------|------|
| `textfield` | Short text | string |
| `textarea` | Long text | string |
| `number` | Numbers | integer/number |
| `checkbox` | Boolean | boolean |
| `select` | Dropdown | string |
| `json` | JSON object | object |
| `javascript` | Code | string |
| `proxy` | Proxy config | object |
| `requestListSources` | URL arrays | array |
| `hidden` | Hidden field | any |

## Validation Patterns

### URL Validation

```json
{
    "pattern": "^https?://.*",
    "example": "https://example.com"
}
```

### Email Validation

```json
{
    "pattern": "^[^@]+@[^@]+\\.[^@]+$",
    "example": "[email protected]"
}
```

### Number Range

```json
{
    "minimum": 1,
    "maximum": 1000,
    "default": 100
}
```

### Required Array

```json
{
    "type": "array",
    "minItems": 1
}
```

### Secret Field

```json
{
    "isSecret": true,
    "editor": "textfield"
}
```

## TypeScript Usage

```typescript
// Define matching interface
interface Input {
    startUrls: { url: string }[];
    maxItems?: number;
    proxyConfiguration?: object;
}

// Use in Actor
await Actor.main(async () => {
    const input = await Actor.getInput<Input>();

    if (!input?.startUrls) {
        throw new Error('startUrls is required');
    }
});
```

## Best Practices

### ✅ DO:
- Provide clear `description` for each field
- Set sensible `default` values
- Use appropriate `editor` types
- Add `example` values
- Validate with `pattern`, `minimum`, `maximum`
- Mark secrets with `isSecret: true`

### ❌ DON'T:
- Don't use `any` type
- Don't skip descriptions
- Don't hardcode large defaults
- Don't forget `required` fields
- Don't expose secrets in prefill

## Resources

- [Input Schema Docs](https://docs.apify.com/platform/actors/development/actor-definition/input-schema)
- [JSON Schema Spec](https://json-schema.org/)
- [Apify Editor Types](https://docs.apify.com/platform/actors/development/actor-definition/input-schema/specification/v1#editor-types)

```

### apify/deployment.md

```markdown
# Actor Deployment Patterns

Testing and deployment workflows for Apify Actors.

## Local Testing

### Basic Run

```bash
# Run with default input
apify run

# Output shows:
# - Actor initialization
# - Logs
# - Results saved to ./storage/datasets/default/
```

### With Custom Input

```bash
# Inline JSON
apify run --input='{"startUrls":[{"url":"https://example.com"}]}'

# From file
apify run --input-file=./test-input.json

# Different input file
apify run --input-file=./inputs/production.json
```

### Clean Run

```bash
# Purge storage before running
apify run --purge

# Fresh start, no cached data
```

### Debug Mode

```bash
# Enable debug logging
DEBUG=crawlee:* apify run

# Multiple debug namespaces
DEBUG=crawlee:*,apify:* apify run

# Playwright debug
DEBUG=pw:api apify run
```

## TypeScript Build

### Build Before Run

```bash
# Compile TypeScript
npm run build

# Output: dist/main.js

# Then run
npm start
# Or
node dist/main.js
```

### Watch Mode (Development)

```bash
# Auto-rebuild on changes
npm run build -- --watch

# In another terminal
npm start
```

### Build Errors

```bash
# Check TypeScript errors
npm run build

# Fix errors, then retry
# Common issues:
# - Type mismatches
# - Missing imports
# - Syntax errors
```

## Deployment to Platform

### First Deployment

```bash
# Push to Apify platform
apify push

# Process:
# 1. Uploads source code
# 2. Builds Docker image
# 3. Creates new Actor version
# 4. Sets as latest
```

### With Build Tag

```bash
# Deploy to specific tag
apify push --build-tag beta

# Deploy to dev
apify push --build-tag dev

# Production release
apify push --build-tag latest
```

### With Version

```bash
# Set version number
apify push --version-number 1.2.3

# Updates .actor/actor.json version field
```

### Wait for Build

```bash
# Wait until build completes
apify push --wait-for-finish

# Useful in CI/CD pipelines
```

### Force Rebuild

```bash
# Force rebuild even if no changes
apify push --force

# Use when:
# - Dependencies updated
# - Dockerfile changed
# - Build cache issues
```

## Testing on Platform

### Run Actor

```bash
# Run latest version
apify call my-actor

# Run specific build
apify call my-actor --build=beta

# With input
apify call my-actor --input='{"maxItems":10}'

# With input file
apify call my-actor --input-file=./input.json
```

### Monitor Run

```bash
# Get run info
apify call my-actor --wait-for-finish

# Shows:
# - Run ID
# - Status
# - Duration
# - Results
```

## Version Management

### Semantic Versioning

```bash
# Major version (breaking changes)
apify push --version-number 2.0.0

# Minor version (new features)
apify push --version-number 1.1.0

# Patch version (bug fixes)
apify push --version-number 1.0.1
```

### Build Tags

```bash
# Development
apify push --build-tag dev

# Staging/testing
apify push --build-tag beta

# Production
apify push --build-tag latest
apify push --build-tag v1.0.0
```

### Tag Strategy

```
main branch    → --build-tag latest
develop branch → --build-tag dev
release/beta   → --build-tag beta
feature/*      → --build-tag feature-name
```

## Complete Workflow Patterns

### Pattern 1: Development Cycle

```bash
# 1. Make changes
vim src/main.ts

# 2. Build
npm run build

# 3. Test locally
apify run --purge

# 4. Fix issues, repeat 2-3

# 5. Lint code
npm run lint:fix

# 6. Deploy to dev
apify push --build-tag dev

# 7. Test on platform
apify call my-actor --build=dev

# 8. Deploy to production
apify push --build-tag latest --version-number 1.0.1
```

### Pattern 2: Quick Test

```bash
# Quick test without build
apify run --input='{"startUrls":[{"url":"https://example.com"}],"maxItems":5}'

# Check ./storage/datasets/default/
cat storage/datasets/default/*.json
```

### Pattern 3: CI/CD Deployment

```bash
#!/bin/bash
# deploy.sh

# Build TypeScript
npm run build

# Run tests
npm test

# Lint
npm run lint

# Push to platform
apify push --build-tag ${BUILD_TAG} --wait-for-finish

# Test deployment
apify call ${ACTOR_ID} --build=${BUILD_TAG} --wait-for-finish
```

### Pattern 4: Staged Release

```bash
# 1. Deploy to beta
apify push --build-tag beta --version-number 1.1.0

# 2. Test beta
apify call my-actor --build=beta

# 3. Monitor for issues
# ... wait 24 hours ...

# 4. Promote to production
apify push --build-tag latest --version-number 1.1.0
```

## Storage Inspection

### View Results

```bash
# Local datasets
ls storage/datasets/default/
cat storage/datasets/default/000000001.json

# Pretty print JSON
cat storage/datasets/default/*.json | jq '.'
```

### Key-Value Store

```bash
# View KV store
ls storage/key_value_stores/default/
cat storage/key_value_stores/default/INPUT.json
```

### Request Queue

```bash
# View queue
ls storage/request_queues/default/
```

## Troubleshooting Deployment

### Build Fails

```bash
# Check build log
apify push

# Common issues:
# - TypeScript errors → run npm run build locally
# - Missing dependencies → check package.json
# - Dockerfile errors → test docker build locally
```

### Actor Won't Start

```bash
# Check logs in Apify Console
# Or via CLI:
apify call my-actor --wait-for-finish

# Common issues:
# - Memory too low → increase in actor.json
# - Timeout → increase timeoutSecs
# - Missing environment variables
```

### Build Too Slow

```bash
# Use faster base image
# In Dockerfile:
FROM apify/actor-node-playwright-chrome:20-bookworm-slim

# Skip optional dependencies
RUN npm install --production --no-optional
```

### Deployment Fails

```bash
# Check auth
apify info

# Re-login if needed
apify logout
apify login

# Retry with force
apify push --force
```

## Platform Commands

### View Datasets

```bash
# List datasets
apify dataset ls

# Get dataset
apify dataset get <dataset-id>

# Download CSV
apify dataset get <dataset-id> --format csv > data.csv
```

### View Runs

```bash
# List recent runs
apify actor calls my-actor

# Get specific run
apify run get <run-id>

# Abort run
apify run abort <run-id>
```

### Manage Actor

```bash
# Get actor info
apify actor get my-actor

# Update actor
apify push

# Delete actor (careful!)
# Must be done via Console
```

## Best Practices

### ✅ DO:
- Test locally before pushing
- Use semantic versioning
- Tag dev/beta/latest appropriately
- Run `npm run build` before testing TypeScript
- Use `--purge` for clean tests
- Wait for build to complete in CI/CD
- Monitor first runs after deployment

### ❌ DON'T:
- Don't push untested code
- Don't skip version numbers
- Don't use `--force` unnecessarily
- Don't deploy directly to `latest` without testing
- Don't ignore build warnings
- Don't commit secrets to git

## Quick Reference

```bash
# Local development
apify run                        # Run locally
apify run --purge               # Clean run
apify run --input-file=input.json # Custom input
npm run build                    # Build TypeScript
npm start                        # Run built code

# Deployment
apify push                       # Deploy
apify push --build-tag beta     # Deploy to beta
apify push --version-number 1.0.0 # Set version
apify push --wait-for-finish    # Wait for build

# Testing
apify call my-actor             # Run on platform
apify call my-actor --build=beta # Run specific build

# Inspection
apify dataset ls                # List datasets
apify dataset get <id>          # Get dataset
```

## Resources

- [Deployment Docs](https://docs.apify.com/platform/actors/development/deployment)
- [Build Process](https://docs.apify.com/platform/actors/development/builds-and-runs/builds)
- [CLI Reference](https://docs.apify.com/cli/docs/reference)

```

### examples/sitemap-basic.js

```javascript
/**
 * Basic Sitemap-Based Scraper
 *
 * This example shows how to:
 * 1. Automatically discover sitemaps using RobotsFile
 * 2. Get all URLs from sitemaps
 * 3. Scrape pages using Playwright
 *
 * Use this pattern for: E-commerce sites, blogs, news sites with sitemaps
 */

import { PlaywrightCrawler, RobotsFile, Dataset } from 'crawlee';

async function main() {
    const baseUrl = 'https://example.com';

    console.log(`🔍 Discovering sitemaps for ${baseUrl}...`);

    // Step 1: Automatically find and parse all sitemaps
    const robots = await RobotsFile.find(baseUrl);
    const urls = await robots.parseUrlsFromSitemaps();

    console.log(`✓ Found ${urls.length} URLs from sitemaps`);

    // Optional: Filter URLs (e.g., only product pages)
    const productUrls = urls.filter(url => url.includes('/products/'));
    console.log(`✓ Filtered to ${productUrls.length} product URLs`);

    // Step 2: Create crawler
    const crawler = new PlaywrightCrawler({
        maxConcurrency: 5,
        maxRequestsPerMinute: 60,

        async requestHandler({ page, request, log }) {
            log.info(`Scraping: ${request.url}`);

            // Wait for content to load
            await page.waitForSelector('body');

            // Extract data
            const data = await page.evaluate(() => ({
                title: document.querySelector('h1')?.textContent?.trim(),
                price: document.querySelector('.price')?.textContent?.trim(),
                description: document.querySelector('.description')?.textContent?.trim(),
                image: document.querySelector('img.main-image')?.src,
                inStock: document.querySelector('.in-stock') !== null,
            }));

            // Save to dataset
            await Dataset.pushData({
                url: request.url,
                ...data,
                scrapedAt: new Date().toISOString(),
            });
        },

        failedRequestHandler({ request, error }, { log }) {
            log.error(`Failed to scrape ${request.url}: ${error.message}`);
        },
    });

    // Step 3: Add URLs and run
    await crawler.addRequests(productUrls.slice(0, 10)); // Test with first 10
    await crawler.run();

    console.log('✓ Scraping completed');
}

main();

```

### examples/api-scraper.js

```javascript
/**
 * API-Based Scraper
 *
 * This example shows how to:
 * 1. Use APIs instead of scraping HTML
 * 2. Handle authentication (cookies, tokens)
 * 3. Process JSON responses
 *
 * Use this pattern for: Any site with a discoverable API
 */

import { gotScraping } from 'got-scraping';
import { setTimeout } from 'timers/promises';

async function main() {
    // Example: Scrape products via API
    const baseApiUrl = 'https://api.example.com/v1';
    const productIds = [123, 456, 789]; // Get these from sitemap or exploration

    const results = [];

    console.log(`🔍 Fetching ${productIds.length} products via API...`);

    for (const id of productIds) {
        try {
            console.log(`Fetching product ${id}...`);

            const response = await gotScraping({
                url: `${baseApiUrl}/products/${id}`,
                responseType: 'json',
                headers: {
                    'User-Agent': 'Mozilla/5.0 (compatible; Scraper/1.0)',
                    // Add authentication if needed:
                    // 'Authorization': 'Bearer YOUR_TOKEN',
                    // 'X-API-Key': 'YOUR_API_KEY',
                },
                timeout: {
                    request: 10000, // 10 second timeout
                },
                retry: {
                    limit: 3,
                    methods: ['GET'],
                },
            });

            // API returns clean JSON
            const product = response.body;

            results.push({
                id: product.id,
                name: product.name,
                price: product.price,
                inStock: product.in_stock,
                scrapedAt: new Date().toISOString(),
            });

            console.log(`✓ Fetched: ${product.name}`);

            // Rate limiting (respect API limits)
            await setTimeout(100); // 100ms delay = 10 requests/second

        } catch (error) {
            if (error.response?.statusCode === 404) {
                console.log(`✗ Product ${id} not found`);
            } else if (error.response?.statusCode === 429) {
                console.log(`⚠ Rate limited, waiting 5 seconds...`);
                await setTimeout(5000);
                // Retry this product
            } else {
                console.error(`✗ Error fetching product ${id}:`, error.message);
            }
        }
    }

    console.log(`✓ Fetched ${results.length}/${productIds.length} products`);
    console.log(JSON.stringify(results, null, 2));
}

main();

```

### examples/hybrid-sitemap-api.js

```javascript
/**
 * Hybrid: Sitemap + API Scraper
 *
 * This example shows how to:
 * 1. Get all URLs from sitemap (instant discovery)
 * 2. Extract IDs from URLs
 * 3. Fetch data via API (clean JSON)
 *
 * Use this pattern for: Best performance + data quality
 * Performance: 60x faster than crawling + more reliable than HTML scraping
 */

import { RobotsFile } from 'crawlee';
import { gotScraping } from 'got-scraping';
import { setTimeout } from 'timers/promises';

async function main() {
    const baseUrl = 'https://shop.example.com';

    console.log('🔍 Phase 1: Sitemap Discovery');

    // Step 1: Get all URLs from sitemap (instant!)
    const robots = await RobotsFile.find(baseUrl);
    const urls = await robots.parseUrlsFromSitemaps();

    console.log(`✓ Found ${urls.length} URLs from sitemap`);

    // Step 2: Extract product IDs from URLs
    const productIds = urls
        .map(url => {
            // Extract ID from URL pattern: /products/123
            const match = url.match(/\/products\/(\d+)/);
            return match ? match[1] : null;
        })
        .filter(Boolean); // Remove nulls

    console.log(`✓ Extracted ${productIds.length} product IDs`);

    console.log('🔍 Phase 2: API Data Fetching');

    // Step 3: Fetch data via API (much faster than scraping HTML!)
    const results = [];

    for (const id of productIds.slice(0, 50)) { // Limit to 50 for demo
        try {
            const response = await gotScraping({
                url: `https://api.example.com/v1/products/${id}`,
                responseType: 'json',
                headers: {
                    'User-Agent': 'Mozilla/5.0...',
                },
                timeout: {
                    request: 10000,
                },
            });

            results.push({
                id: response.body.id,
                name: response.body.name,
                price: response.body.price,
                url: `${baseUrl}/products/${id}`,
                scrapedAt: new Date().toISOString(),
            });

            if (results.length % 10 === 0) {
                console.log(`✓ Fetched ${results.length}/${productIds.length} products`);
            }

            // Rate limiting
            await setTimeout(50); // 20 requests/second

        } catch (error) {
            console.error(`✗ Failed to fetch product ${id}:`, error.message);
        }
    }

    console.log(`✓ Completed: ${results.length} products`);
    console.log('Sample result:', results[0]);

    // Save results (in real scenario)
    // await fs.writeFile('products.json', JSON.stringify(results, null, 2));
}

main();

```

### examples/playwright-basic.js

```javascript
/**
 * Basic Playwright Scraper
 *
 * This example shows how to:
 * 1. Scrape JavaScript-rendered content
 * 2. Use proper selectors (role-based)
 * 3. Handle auto-waiting
 *
 * Use this pattern for: JavaScript-heavy sites (React, Vue, Angular)
 */

import { PlaywrightCrawler, Dataset } from 'crawlee';

async function main() {
    const crawler = new PlaywrightCrawler({
        // Run 3 browsers in parallel
        maxConcurrency: 3,

        // Limit requests per minute
        maxRequestsPerMinute: 30,

        async requestHandler({ page, request, log, enqueueLinks }) {
            log.info(`Scraping: ${request.url}`);

            // Wait for content to load (automatic waiting)
            await page.waitForSelector('h1');

            // Extract data using page.evaluate()
            const data = await page.evaluate(() => {
                return {
                    title: document.querySelector('h1')?.textContent?.trim(),
                    price: document.querySelector('.price')?.textContent?.trim(),
                    description: document.querySelector('.description')?.textContent?.trim(),

                    // Extract multiple items
                    features: Array.from(document.querySelectorAll('.feature')).map(el => ({
                        name: el.querySelector('.name')?.textContent?.trim(),
                        value: el.querySelector('.value')?.textContent?.trim(),
                    })),

                    // Extract images
                    images: Array.from(document.querySelectorAll('img.product-image'))
                        .map(img => img.src),
                };
            });

            // Save to dataset
            await Dataset.pushData({
                url: request.url,
                ...data,
                scrapedAt: new Date().toISOString(),
            });

            // Optional: Enqueue links to other pages
            await enqueueLinks({
                selector: 'a.related-product',
                strategy: 'same-domain',
            });
        },

        failedRequestHandler({ request, error }, { log }) {
            log.error(`Request failed: ${request.url} - ${error.message}`);
        },
    });

    // Start URLs
    await crawler.run([
        'https://example.com/product/1',
        'https://example.com/product/2',
        'https://example.com/product/3',
    ]);

    console.log('✓ Scraping completed');
}

main();

```

### examples/iterative-fallback.js

```javascript
/**
 * Iterative Fallback Scraper
 *
 * This example shows how to:
 * 1. Try simplest approach first (Sitemap + API)
 * 2. Automatically fallback if it fails
 * 3. End with most complex (Playwright crawling)
 *
 * Use this pattern for: Unknown sites, maximum reliability
 */

import { RobotsFile, PlaywrightCrawler, Dataset } from 'crawlee';
import { gotScraping } from 'got-scraping';

async function scrapeWithFallback(baseUrl) {
    console.log(`🔍 Starting intelligent scraping for ${baseUrl}`);

    // ============================================
    // Attempt 1: Sitemap + API (FASTEST)
    // ============================================
    try {
        console.log('\n📋 Attempt 1: Sitemap + API');

        // Get URLs from sitemap
        const robots = await RobotsFile.find(baseUrl);
        const urls = await robots.parseUrlsFromSitemaps();

        if (urls.length === 0) {
            throw new Error('No URLs found in sitemap');
        }

        console.log(`✓ Found ${urls.length} URLs in sitemap`);

        // Extract IDs
        const ids = urls
            .map(url => url.match(/\/products\/(\d+)/)?.[1])
            .filter(Boolean)
            .slice(0, 5); // Test with 5

        console.log(`✓ Extracted ${ids.length} product IDs`);

        // Try API
        console.log('Testing API...');
        const apiUrl = `https://api.${baseUrl.replace('https://', '')}/products/${ids[0]}`;

        const testResponse = await gotScraping({
            url: apiUrl,
            responseType: 'json',
            timeout: { request: 5000 },
        });

        console.log('✓ API works! Using Sitemap + API approach');

        // Fetch all data via API
        const results = [];
        for (const id of ids) {
            const response = await gotScraping({
                url: `https://api.${baseUrl.replace('https://', '')}/products/${id}`,
                responseType: 'json',
            });
            results.push(response.body);
        }

        console.log(`✅ Success with Sitemap + API: ${results.length} products`);
        return { method: 'sitemap-api', data: results };

    } catch (error) {
        console.log(`✗ Sitemap + API failed: ${error.message}`);
    }

    // ============================================
    // Attempt 2: Sitemap + Playwright
    // ============================================
    try {
        console.log('\n📋 Attempt 2: Sitemap + Playwright');

        const robots = await RobotsFile.find(baseUrl);
        const urls = await robots.parseUrlsFromSitemaps();

        if (urls.length === 0) {
            throw new Error('No URLs found in sitemap');
        }

        console.log(`✓ Found ${urls.length} URLs in sitemap`);

        const crawler = new PlaywrightCrawler({
            maxConcurrency: 3,
            async requestHandler({ page, request }) {
                const data = await page.evaluate(() => ({
                    title: document.querySelector('h1')?.textContent,
                    price: document.querySelector('.price')?.textContent,
                }));

                await Dataset.pushData({ url: request.url, ...data });
            },
        });

        await crawler.addRequests(urls.slice(0, 5)); // Test with 5
        await crawler.run();

        const results = await Dataset.getData();
        console.log(`✅ Success with Sitemap + Playwright: ${results.items.length} products`);
        return { method: 'sitemap-playwright', data: results.items };

    } catch (error) {
        console.log(`✗ Sitemap + Playwright failed: ${error.message}`);
    }

    // ============================================
    // Attempt 3: Pure Playwright Crawling (FALLBACK)
    // ============================================
    try {
        console.log('\n📋 Attempt 3: Playwright Crawling (fallback)');

        const crawler = new PlaywrightCrawler({
            maxRequestsPerCrawl: 10,
            async requestHandler({ page, request, enqueueLinks }) {
                const data = await page.evaluate(() => ({
                    title: document.querySelector('h1')?.textContent,
                    price: document.querySelector('.price')?.textContent,
                }));

                await Dataset.pushData({ url: request.url, ...data });

                // Crawl links
                await enqueueLinks({
                    selector: 'a[href*="/products/"]',
                    strategy: 'same-domain',
                });
            },
        });

        await crawler.run([baseUrl]);

        const results = await Dataset.getData();
        console.log(`✅ Success with Playwright Crawling: ${results.items.length} products`);
        return { method: 'playwright-crawl', data: results.items };

    } catch (error) {
        console.log(`✗ Playwright Crawling failed: ${error.message}`);
    }

    // ============================================
    // All attempts failed
    // ============================================
    console.log('\n❌ All scraping methods failed');
    throw new Error('Unable to scrape site with any method');
}

// Usage
async function main() {
    try {
        const result = await scrapeWithFallback('https://example.com');
        console.log(`\n✅ Final result: Used ${result.method}, got ${result.data.length} items`);
    } catch (error) {
        console.error(`❌ Scraping failed: ${error.message}`);
    }
}

main();

```

### reference/selector-guide.md

```markdown
# Playwright Selector Guide

Quick reference for stable, reliable selectors.

## Priority Order (Most Stable → Least Stable)

### 1. Role-Based (BEST)

```javascript
page.getByRole('button', { name: 'Add to cart' })
page.getByRole('heading', { level: 1 })
page.getByRole('link', { name: 'Next page' })
page.getByRole('textbox', { name: 'Email' })
```

### 2. Test IDs

```javascript
page.getByTestId('product-price')
page.getByTestId('add-to-cart-button')
```

### 3. Labels (Forms)

```javascript
page.getByLabel('Email')
page.getByLabel('Password')
```

### 4. Text Content

```javascript
page.getByText('Sign in')
page.getByText('Add to cart')
```

### 5. CSS/XPath (LAST RESORT)

```javascript
page.locator('.product-price')
page.locator('xpath=//div[@class="content"]')
```

## Common Roles

```javascript
button, link, textbox, checkbox, radio, combobox, 
listbox, menu, menuitem, tab, tabpanel, dialog,
heading, img, list, listitem, table, row, cell
```

## Chaining Selectors

```javascript
// Find button within a specific section
page.locator('.checkout-section').getByRole('button', { name: 'Pay' })

// Find specific list item
page.getByRole('list').getByRole('listitem').filter({ hasText: 'Apple' })
```

```

web-scraping | SkillHub