SkillHub ClubWrite Technical DocsFull StackDevOpsTech Writer

website-to-vite-scraper

A multi-provider website scraper that converts dynamic sites to static deployments. Combines Playwright, Apify, and Firecrawl to handle different site types, with built-in asset downloading and Cloudflare Pages deployment. Useful for cloning, reverse-engineering, or archiving websites.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars

Hot score

Updated

March 20, 2026

Overall rating

A7.7

Composite score

5.1

Best-practice grade

B81.2

Install command

npx @skill-hub/cli install breverdbidder-life-os-website-to-vite-scraper

web-scrapingstatic-site-generationplaywrightautomationcloudflare

Repository

breverdbidder/life-os

Skill path: skills/website-to-vite-scraper

Open repository

Best for

Primary workflow: Write Technical Docs.

Technical facets: Full Stack, DevOps, Tech Writer, Testing.

Target audience: Developers needing to clone websites for analysis, create static versions of dynamic sites, or archive web content.

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: breverdbidder.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

Install website-to-vite-scraper into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
Review https://github.com/breverdbidder/life-os before adding website-to-vite-scraper to shared team environments
Use website-to-vite-scraper for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: website-to-vite-scraper
description: Multi-provider website scraper that converts any website (including CSR/SPA) to deployable static sites. Uses Playwright, Apify RAG Browser, Crawl4AI, and Firecrawl for comprehensive scraping. Triggers on requests to clone, reverse-engineer, or convert websites.
version: "2.0"
---

# Website-to-Vite Scraper V2

Multi-provider website scraper with AI-powered extraction for any website type.

## Scraping Methods

| Method | Best For | Anti-Bot | JS Rendering | Cost |
|--------|----------|----------|--------------|------|
| **Playwright** | General sites, Next.js/React apps | ❌ | ✅ Full | FREE |
| **Apify RAG Browser** | LLM/RAG-optimized content | ✅ | ✅ Adaptive | Credits |
| **Crawl4AI** | AI training data, clean extraction | ✅ | ✅ | Credits |
| **Firecrawl** | Protected sites, anti-bot bypass | ✅✅ | ✅ | $16/mo |

## Quick Start

### GitHub Actions (Recommended)

```bash
# Go to: Actions → Website Scraper V2 → Run workflow
# Options:
#   - URL: https://www.reventure.app/
#   - Project name: reventure-clone
#   - Method: all (tries all providers)
#   - Deploy: true
```

### API MEGA LIBRARY Integration

The following APIs from our library enhance this scraper:

| API | Purpose | Status |
|-----|---------|--------|
| `APIFY_API_TOKEN` | RAG Browser, Crawl4AI, Web Scraper | ✅ Configured |
| `FIRECRAWL_API_KEY` | Anti-bot bypass, stealth mode | ✅ Configured |
| `BROWSERLESS_API_KEY` | Alternative headless browser | 🔄 Available |

### MCP Server Integration

Connect Claude Desktop/Cursor to Apify MCP for AI-powered scraping:

```json
{
  "mcpServers": {
    "apify": {
      "command": "npx",
      "args": ["@apify/actors-mcp-server"],
      "env": {
        "APIFY_TOKEN": "your-apify-api-token"
      }
    }
  }
}
```

Or use hosted: `https://mcp.apify.com?token=YOUR_TOKEN`

## Apify Actors Used

### apify/rag-web-browser
- **Purpose:** LLM-optimized web content extraction
- **Output:** Markdown, HTML, text
- **Features:** 
  - Playwright adaptive (handles JS)
  - Clean content extraction
  - Link following
  - Metadata extraction

### raizen/ai-web-scraper (Crawl4AI)
- **Purpose:** AI training data collection
- **Output:** Cleaned markdown, structured links
- **Features:**
  - Excludes boilerplate (headers, footers, nav)
  - Word count thresholding
  - External link filtering

### Firecrawl
- **Purpose:** Anti-bot protected sites
- **Output:** Markdown, HTML, screenshots
- **Features:**
  - Anti-detection technology
  - JavaScript rendering
  - Main content extraction
  - 5-second wait for dynamic content

## Output Structure

```
project-name/
├── dist/
│   ├── index.html      # Best merged HTML
│   ├── screenshot.png  # Full page capture
│   ├── meta.json       # Scrape metadata
│   └── assets/
│       ├── images/     # Downloaded images
│       ├── css/        # Stylesheets
│       └── js/         # Scripts
└── results/
    ├── playwright/     # Raw Playwright output
    ├── apify-rag/      # RAG Browser output
    ├── crawl4ai/       # Crawl4AI output
    └── firecrawl/      # Firecrawl output
```

## Handling CSR/SPA Sites

Sites like Next.js, React, Vue that render client-side require JavaScript execution:

1. **Playwright** waits for `networkidle` + 5 seconds
2. **Apify RAG** uses adaptive crawler (Playwright when needed)
3. **Firecrawl** has built-in JS rendering

For `__NEXT_DATA__` extraction (Next.js sites):
- Playwright automatically extracts and saves to `next_data.json`
- Can be parsed to reconstruct static pages

## Workflow Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `url` | string | required | Website URL to scrape |
| `project_name` | string | required | Output folder/Cloudflare project name |
| `scrape_method` | choice | playwright | Method to use |
| `extract_assets` | boolean | true | Download images/CSS/JS |
| `deploy_cloudflare` | boolean | true | Deploy to Cloudflare Pages |

## Cost Optimization

| Scenario | Recommended Method |
|----------|-------------------|
| Simple static site | Playwright (FREE) |
| JS-heavy SPA | Playwright → Apify RAG fallback |
| Protected site (Cloudflare) | Firecrawl |
| AI/RAG pipeline | Apify RAG or Crawl4AI |
| Maximum coverage | `all` method |

## Security Assessment

Per API_MEGA_LIBRARY guidelines:

| API | Security Score | Recommendation |
|-----|----------------|----------------|
| Apify | 85/100 | ✅ ADOPT |
| Firecrawl | 82/100 | ✅ ADOPT |
| Playwright | 90/100 | ✅ ADOPT (local) |

## Troubleshooting

### Site returns blank page
1. Try `scrape_method: all` to use multiple providers
2. Increase wait time in Playwright
3. Check if site blocks datacenter IPs → use Firecrawl

### Assets not downloading
1. Some sites block direct asset requests
2. Use relative paths from original HTML
3. Check for CORS restrictions

### Cloudflare protection detected
1. Use Firecrawl (has anti-bot bypass)
2. Or use Apify with residential proxies

## Related Skills

- `auction-results` - Uses similar scraping for auction data
- `bcpao-scraper` - BCPAO property data extraction
- `youtube-transcript` - Video content extraction

## Changelog

### V2.0 (Dec 2025)
- Added multi-provider support (Playwright, Apify, Firecrawl)
- MCP server integration
- Automatic provider fallback
- Asset downloading
- Cloudflare Pages deployment

### V1.0 (Dec 2025)
- Initial Playwright-only scraper
- Basic HTML/CSS/JS extraction