bigquery-ethereum-data-acquisition
Workflow for acquiring historical Ethereum blockchain data using Google BigQuery free tier. Empirically validated for cost estimation, streaming downloads, and DuckDB integration. Use when planning bulk historical data acquisition or comparing data source options for blockchain network metrics.
Packaged view
This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.
Install command
npx @skill-hub/cli install terrylica-gapless-network-data-bigquery-ethereum-data-acquisition
Repository
Skill path: .claude/skills/bigquery-ethereum-data-acquisition
Workflow for acquiring historical Ethereum blockchain data using Google BigQuery free tier. Empirically validated for cost estimation, streaming downloads, and DuckDB integration. Use when planning bulk historical data acquisition or comparing data source options for blockchain network metrics.
Open repositoryBest for
Primary workflow: Research & Ops.
Technical facets: Full Stack, Data / AI, Integration.
Target audience: everyone.
License: Unknown.
Original source
Catalog source: SkillHub Club.
Repository owner: terrylica.
This is still a mirrored public skill entry. Review the repository before installing into production workflows.
What it helps with
- Install bigquery-ethereum-data-acquisition into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
- Review https://github.com/terrylica/gapless-network-data before adding bigquery-ethereum-data-acquisition to shared team environments
- Use bigquery-ethereum-data-acquisition for development workflows
Works across
Favorites: 0.
Sub-skills: 0.
Aggregator: No.
Original source / Raw SKILL.md
--- name: bigquery-ethereum-data-acquisition description: Workflow for acquiring historical Ethereum blockchain data using Google BigQuery free tier. Empirically validated for cost estimation, streaming downloads, and DuckDB integration. Use when planning bulk historical data acquisition or comparing data source options for blockchain network metrics. --- # BigQuery Ethereum Data Acquisition ## Overview Workflow for acquiring 5 years of Ethereum blockchain data (2020-2025, ~12.44M blocks) using Google BigQuery's public dataset within free tier limits. Includes column selection analysis for ML/time-series forecasting, cost optimization, and empirically validated download workflows. **Status**: ✅ Empirically validated (v0.2.0, 2025-11-07) ## When to Use This Skill Use when: - Evaluating BigQuery as a data source for Ethereum historical data - Planning bulk historical blockchain data acquisition - Optimizing column selection for feature engineering - Calculating query costs and free tier utilization - Comparing BigQuery vs RPC polling approaches - Streaming downloads without BigQuery storage ## Core Workflow This skill follows a 5-step workflow for acquiring Ethereum data from BigQuery: | Step | Purpose | Output | Key Metric | | ----------------------- | ------------------------------- | ----------------- | ------------------------- | | **1. Free Tier Limits** | Understand query/storage limits | Limits documented | 1 TB query, 10 GB storage | | **2. Column Selection** | Optimize for ML/time-series | 11 columns chosen | 0.97 GB (97% savings) | | **3. Cost Validation** | Dry-run query cost | Cost estimate | 0.1% of free tier | | **4. Stream Download** | Parquet download (no storage) | .parquet file | 62 bytes/row | | **5. DuckDB Import** | Load for analysis | DuckDB database | <100ms query time | **Detailed workflow**: See `references/workflow-steps.md` for complete step-by-step guide with SQL queries, bash commands, and validated results for each step. **Quick start**: Run `uv run scripts/test_bigquery_cost.py` to validate cost, then `uv run scripts/download_bigquery_to_parquet.py <start> <end> <output>` to download. ## Cost Analysis **Optimized selection**: 11 columns = 0.97 GB (0.1% of free tier, 97% cost savings vs all 23 columns) **Key finding**: BigQuery is 624x faster than RPC polling (<1 hour vs 26 days for 12.44M blocks). **Full analysis**: See `references/cost-analysis.md` for detailed cost comparison, column selection rationale, and RPC provider comparison. ## Prerequisites **One-time setup**: gcloud auth, Python dependencies (google-cloud-bigquery, pandas, pyarrow, db-dtypes) **Setup guide**: See `references/setup-guide.md` for complete authentication setup, dependency installation, and verification commands. ## Scripts Validated scripts (v0.2.0): - `test_bigquery_cost.py` - Dry-run cost estimation (0.97 GB for 12.44M blocks) - `download_bigquery_to_parquet.py` - Streaming Parquet download (62 bytes/row validated) **Templates and usage**: See `scripts/README.md` for complete usage examples, dependencies, and validated results. ## References ### Workflow Documentation - `references/workflow-steps.md` - Complete 5-step workflow with SQL queries, bash commands, and validated results - `references/cost-analysis.md` - Detailed cost comparison, column selection rationale, RPC provider comparison - `references/setup-guide.md` - Authentication setup, dependencies, verification commands ### Research Documents - `references/bigquery_cost_comparison.md` - Empirical cost analysis (6 vs 11 vs 23 columns) - `references/ethereum_columns_ml_evaluation.md` - Column-by-column ML value analysis - `references/bigquery_complete_ethereum_data.md` - Complete dataset catalog (11 tables) - `references/bigquery_cost_estimate.md` - Free tier limits and methodology - `references/littleblack-hardware-report.md` - Local vs cloud hardware comparison ### Scripts & Validation - `scripts/README.md` - Complete script usage guide with validated results - `VALIDATION_STATUS.md` - Empirical test results, testing methodology, dependencies validated ## Verification After Acquisition **Important**: This skill covers data acquisition from BigQuery (downloading historical Ethereum blocks), but does NOT verify the data actually landed in ClickHouse. After completing BigQuery download: - Use the **historical-backfill-execution** skill - Run `scripts/clickhouse/verify_blocks.py` to verify database state - Confirm expected block count (~23.8M blocks for 2015-2025 backfill) **Common workflow**: 1. Download from BigQuery using this skill (Step 4 above) 2. Insert to ClickHouse via `chunked_backfill.sh` 3. **Verify ClickHouse state** using historical-backfill-execution skill 4. Check yearly breakdown to ensure complete coverage See `historical-backfill-execution` skill for database verification and troubleshooting missing data. ## Related Skills - `historical-backfill-execution` - ClickHouse database verification and backfill operations - `blockchain-rpc-provider-research` - RPC rate limit comparison and provider evaluation - `blockchain-data-collection-validation` - Empirical validation workflow for data pipelines