website-to-vite-scraper
A multi-provider website scraper that converts dynamic sites to static deployments. Combines Playwright, Apify, and Firecrawl to handle different site types, with built-in asset downloading and Cloudflare Pages deployment. Useful for cloning, reverse-engineering, or archiving websites.
Packaged view
This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.
Install command
npx @skill-hub/cli install breverdbidder-life-os-website-to-vite-scraper
Repository
Skill path: skills/website-to-vite-scraper
A multi-provider website scraper that converts dynamic sites to static deployments. Combines Playwright, Apify, and Firecrawl to handle different site types, with built-in asset downloading and Cloudflare Pages deployment. Useful for cloning, reverse-engineering, or archiving websites.
Open repositoryBest for
Primary workflow: Write Technical Docs.
Technical facets: Full Stack, DevOps, Tech Writer, Testing.
Target audience: Developers needing to clone websites for analysis, create static versions of dynamic sites, or archive web content.
License: Unknown.
Original source
Catalog source: SkillHub Club.
Repository owner: breverdbidder.
This is still a mirrored public skill entry. Review the repository before installing into production workflows.
What it helps with
- Install website-to-vite-scraper into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
- Review https://github.com/breverdbidder/life-os before adding website-to-vite-scraper to shared team environments
- Use website-to-vite-scraper for development workflows
Works across
Favorites: 0.
Sub-skills: 0.
Aggregator: No.
Original source / Raw SKILL.md
---
name: website-to-vite-scraper
description: Multi-provider website scraper that converts any website (including CSR/SPA) to deployable static sites. Uses Playwright, Apify RAG Browser, Crawl4AI, and Firecrawl for comprehensive scraping. Triggers on requests to clone, reverse-engineer, or convert websites.
version: "2.0"
---
# Website-to-Vite Scraper V2
Multi-provider website scraper with AI-powered extraction for any website type.
## Scraping Methods
| Method | Best For | Anti-Bot | JS Rendering | Cost |
|--------|----------|----------|--------------|------|
| **Playwright** | General sites, Next.js/React apps | ❌ | ✅ Full | FREE |
| **Apify RAG Browser** | LLM/RAG-optimized content | ✅ | ✅ Adaptive | Credits |
| **Crawl4AI** | AI training data, clean extraction | ✅ | ✅ | Credits |
| **Firecrawl** | Protected sites, anti-bot bypass | ✅✅ | ✅ | $16/mo |
## Quick Start
### GitHub Actions (Recommended)
```bash
# Go to: Actions → Website Scraper V2 → Run workflow
# Options:
# - URL: https://www.reventure.app/
# - Project name: reventure-clone
# - Method: all (tries all providers)
# - Deploy: true
```
### API MEGA LIBRARY Integration
The following APIs from our library enhance this scraper:
| API | Purpose | Status |
|-----|---------|--------|
| `APIFY_API_TOKEN` | RAG Browser, Crawl4AI, Web Scraper | ✅ Configured |
| `FIRECRAWL_API_KEY` | Anti-bot bypass, stealth mode | ✅ Configured |
| `BROWSERLESS_API_KEY` | Alternative headless browser | 🔄 Available |
### MCP Server Integration
Connect Claude Desktop/Cursor to Apify MCP for AI-powered scraping:
```json
{
"mcpServers": {
"apify": {
"command": "npx",
"args": ["@apify/actors-mcp-server"],
"env": {
"APIFY_TOKEN": "your-apify-api-token"
}
}
}
}
```
Or use hosted: `https://mcp.apify.com?token=YOUR_TOKEN`
## Apify Actors Used
### apify/rag-web-browser
- **Purpose:** LLM-optimized web content extraction
- **Output:** Markdown, HTML, text
- **Features:**
- Playwright adaptive (handles JS)
- Clean content extraction
- Link following
- Metadata extraction
### raizen/ai-web-scraper (Crawl4AI)
- **Purpose:** AI training data collection
- **Output:** Cleaned markdown, structured links
- **Features:**
- Excludes boilerplate (headers, footers, nav)
- Word count thresholding
- External link filtering
### Firecrawl
- **Purpose:** Anti-bot protected sites
- **Output:** Markdown, HTML, screenshots
- **Features:**
- Anti-detection technology
- JavaScript rendering
- Main content extraction
- 5-second wait for dynamic content
## Output Structure
```
project-name/
├── dist/
│ ├── index.html # Best merged HTML
│ ├── screenshot.png # Full page capture
│ ├── meta.json # Scrape metadata
│ └── assets/
│ ├── images/ # Downloaded images
│ ├── css/ # Stylesheets
│ └── js/ # Scripts
└── results/
├── playwright/ # Raw Playwright output
├── apify-rag/ # RAG Browser output
├── crawl4ai/ # Crawl4AI output
└── firecrawl/ # Firecrawl output
```
## Handling CSR/SPA Sites
Sites like Next.js, React, Vue that render client-side require JavaScript execution:
1. **Playwright** waits for `networkidle` + 5 seconds
2. **Apify RAG** uses adaptive crawler (Playwright when needed)
3. **Firecrawl** has built-in JS rendering
For `__NEXT_DATA__` extraction (Next.js sites):
- Playwright automatically extracts and saves to `next_data.json`
- Can be parsed to reconstruct static pages
## Workflow Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `url` | string | required | Website URL to scrape |
| `project_name` | string | required | Output folder/Cloudflare project name |
| `scrape_method` | choice | playwright | Method to use |
| `extract_assets` | boolean | true | Download images/CSS/JS |
| `deploy_cloudflare` | boolean | true | Deploy to Cloudflare Pages |
## Cost Optimization
| Scenario | Recommended Method |
|----------|-------------------|
| Simple static site | Playwright (FREE) |
| JS-heavy SPA | Playwright → Apify RAG fallback |
| Protected site (Cloudflare) | Firecrawl |
| AI/RAG pipeline | Apify RAG or Crawl4AI |
| Maximum coverage | `all` method |
## Security Assessment
Per API_MEGA_LIBRARY guidelines:
| API | Security Score | Recommendation |
|-----|----------------|----------------|
| Apify | 85/100 | ✅ ADOPT |
| Firecrawl | 82/100 | ✅ ADOPT |
| Playwright | 90/100 | ✅ ADOPT (local) |
## Troubleshooting
### Site returns blank page
1. Try `scrape_method: all` to use multiple providers
2. Increase wait time in Playwright
3. Check if site blocks datacenter IPs → use Firecrawl
### Assets not downloading
1. Some sites block direct asset requests
2. Use relative paths from original HTML
3. Check for CORS restrictions
### Cloudflare protection detected
1. Use Firecrawl (has anti-bot bypass)
2. Or use Apify with residential proxies
## Related Skills
- `auction-results` - Uses similar scraping for auction data
- `bcpao-scraper` - BCPAO property data extraction
- `youtube-transcript` - Video content extraction
## Changelog
### V2.0 (Dec 2025)
- Added multi-provider support (Playwright, Apify, Firecrawl)
- MCP server integration
- Automatic provider fallback
- Asset downloading
- Cloudflare Pages deployment
### V1.0 (Dec 2025)
- Initial Playwright-only scraper
- Basic HTML/CSS/JS extraction