Back to skills
SkillHub ClubAnalyze Data & AIFull StackData / AI

data-wrangler

Transform and export data using DuckDB SQL. Read CSV/Parquet/JSON/Excel/databases, apply SQL transformations (joins, aggregations, PIVOT/UNPIVOT, sampling), and optionally write results to files. Use when the user wants to: (1) Clean, filter, or transform data, (2) Join multiple data sources, (3) Convert between formats (CSV→Parquet, etc.), (4) Create partitioned datasets, (5) Sample large datasets, (6) Export query results. Prefer this over in-context reasoning for datasets with thousands of rows or complex transformations.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars
784
Hot score
99
Updated
March 19, 2026
Overall rating
C4.6
Composite score
4.6
Best-practice grade
S96.0

Install command

npx @skill-hub/cli install benchflow-ai-skillsbench-data-wrangler

Repository

benchflow-ai/SkillsBench

Skill path: registry/terminal_bench_2.0/full_batch_reviewed/terminal_bench_2_0_sqlite-db-truncate/environment/skills/data-wrangler

Transform and export data using DuckDB SQL. Read CSV/Parquet/JSON/Excel/databases, apply SQL transformations (joins, aggregations, PIVOT/UNPIVOT, sampling), and optionally write results to files. Use when the user wants to: (1) Clean, filter, or transform data, (2) Join multiple data sources, (3) Convert between formats (CSV→Parquet, etc.), (4) Create partitioned datasets, (5) Sample large datasets, (6) Export query results. Prefer this over in-context reasoning for datasets with thousands of rows or complex transformations.

Open repository

Best for

Primary workflow: Analyze Data & AI.

Technical facets: Full Stack, Data / AI.

Target audience: everyone.

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: benchflow-ai.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

  • Install data-wrangler into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
  • Review https://github.com/benchflow-ai/SkillsBench before adding data-wrangler to shared team environments
  • Use data-wrangler for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: data-wrangler
description: >
  Transform and export data using DuckDB SQL. Read CSV/Parquet/JSON/Excel/databases,
  apply SQL transformations (joins, aggregations, PIVOT/UNPIVOT, sampling), and optionally
  write results to files. Use when the user wants to: (1) Clean, filter, or transform data,
  (2) Join multiple data sources, (3) Convert between formats (CSV→Parquet, etc.),
  (4) Create partitioned datasets, (5) Sample large datasets, (6) Export query results.
  Prefer this over in-context reasoning for datasets with thousands of rows or complex
  transformations.
---

# Data Wrangler

Transform and export data using DuckDB SQL.

## Contents

- [Usage](#usage) - Command syntax and Windows escaping
- [Explore Mode](#explore-mode) - Quick data profiling
- [Query Mode](#query-mode) - Return results to Claude
- [Write Mode](#write-mode) - Export to files
- [Request/Response Format](#requestresponse-format) - JSON structure
- [Source Types](#source-types) - File, database, and cloud sources
- [Transformations](#transformations) - SQL patterns reference
- [Secrets](#secrets) - Secure credential handling

## Usage

**IMPORTANT - Windows Shell Escaping:**

1. Always `cd` to the skill directory first
2. Use **double quotes** for echo with escaped inner quotes (`\"`)
3. Use **forward slashes** in file paths

```bash
cd "<skill_directory>" && echo "{\"query\": \"SELECT * FROM 'D:/path/to/file.csv'\"}" | uv run scripts/query_duckdb.py
```

## Explore Mode

**Get schema, statistics, and sample in one call.** Use before writing queries to understand data structure.

```json
{"mode": "explore", "path": "D:/data/sales.csv"}
```

**Response:**
```json
{
  "file": "D:/data/sales.csv",
  "format": "csv",
  "row_count": 15234,
  "columns": [
    {"name": "order_id", "type": "BIGINT", "null_count": 0, "null_percent": 0.0},
    {"name": "customer", "type": "VARCHAR", "null_count": 45, "null_percent": 0.3}
  ],
  "sample": "| order_id | customer | ... |\\n|----------|----------|-----|\\n| 1001     | Alice    | ... |"
}
```

**Options:**
- `sample_rows`: Number of sample rows (default: 10, max: 100)
- `sources`: For database tables (same as query mode)

## Query Mode

Return results directly to Claude for analysis.

### Direct File Queries

```json
{"query": "SELECT * FROM 'data.csv' LIMIT 10"}
```

### Multi-Source Joins

```json
{
  "query": "SELECT s.*, p.category FROM sales s JOIN products p ON s.product_id = p.id",
  "sources": [
    {"type": "file", "alias": "sales", "path": "/data/sales.parquet"},
    {"type": "file", "alias": "products", "path": "/data/products.csv"}
  ]
}
```

## Write Mode

Export query results to files. Add an `output` object to write instead of returning data.

### Basic Write

```json
{
  "query": "SELECT * FROM 'raw.csv' WHERE status = 'active'",
  "output": {
    "path": "D:/output/filtered.parquet",
    "format": "parquet"
  }
}
```

### Write with Options

```json
{
  "query": "SELECT *, YEAR(date) as year, MONTH(date) as month FROM 'events.csv'",
  "output": {
    "path": "D:/output/events/",
    "format": "parquet",
    "options": {
      "compression": "zstd",
      "partition_by": ["year", "month"],
      "overwrite": true
    }
  }
}
```

### Output Formats

| Format | Options |
|--------|---------|
| `parquet` | `compression` (zstd/snappy/gzip/lz4), `partition_by`, `row_group_size` |
| `csv` | `header` (default: true), `delimiter`, `compression`, `partition_by` |
| `json` | `array` (true=JSON array, false=newline-delimited) |

### Write Response

Response includes verification info - no need for follow-up queries:

```json
{
  "success": true,
  "output_path": "D:/output/events/",
  "format": "parquet",
  "rows_written": 15234,
  "files_created": ["D:/output/events/year=2023/data_0.parquet", "..."],
  "total_size_bytes": 5678901,
  "duration_ms": 1234
}
```

### Overwrite Protection

By default, existing files are **not** overwritten. Set `options.overwrite: true` to allow.

## Request/Response Format

### Request

```json
{
  "query": "SQL statement",
  "sources": [...],
  "output": {"path": "...", "format": "..."},
  "options": {"max_rows": 200, "format": "markdown"},
  "secrets_file": "path/to/secrets.yaml"
}
```

### Query Mode Options

- `max_rows`: Maximum rows to return (default: 200)
- `max_bytes`: Maximum response size (default: 200000)
- `format`: `markdown` (default), `json`, `records`, or `csv`

### Query Mode Response (markdown)

```
| column1 | column2 |
|---|---|
| value1 | value2 |
```

### Query Mode Response (json)

```json
{
  "schema": [{"name": "col1", "type": "INTEGER"}],
  "rows": [[1, "value"]],
  "truncated": false,
  "warnings": [],
  "error": null
}
```

## Source Types

### File (auto-detects CSV, Parquet, JSON, Excel)

```json
{"type": "file", "alias": "data", "path": "/path/to/file.csv"}
```

Glob patterns: `{"path": "/logs/**/*.parquet"}`

Custom delimiter: `{"path": "/data/file.csv", "delimiter": "|"}`

### PostgreSQL

```json
{
  "type": "postgres", "alias": "users",
  "host": "host", "port": 5432, "database": "db",
  "user": "user", "password": "pass",
  "schema": "public", "table": "users"
}
```

### MySQL

```json
{
  "type": "mysql", "alias": "orders",
  "host": "host", "port": 3306, "database": "db",
  "user": "user", "password": "pass", "table": "orders"
}
```

### SQLite

```json
{"type": "sqlite", "alias": "data", "path": "/path/to/db.sqlite", "table": "tablename"}
```

### S3

```json
{
  "type": "s3", "alias": "logs",
  "url": "s3://bucket/path/*.parquet",
  "aws_region": "us-east-1",
  "aws_access_key_id": "...", "aws_secret_access_key": "..."
}
```

## Transformations

See [TRANSFORMS.md](TRANSFORMS.md) for advanced patterns including:

- **PIVOT/UNPIVOT** - Reshape data between wide and long formats
- **Sampling** - Random subsets with `USING SAMPLE n ROWS` or `SAMPLE 10%`
- **Dynamic columns** - `EXCLUDE`, `REPLACE`, `COLUMNS('pattern')`
- **Window functions** - Running totals, rankings, moving averages
- **Date/time operations** - Extraction, arithmetic, formatting

### Quick Examples

```sql
-- PIVOT: Convert rows to columns
PIVOT sales ON quarter USING SUM(revenue) GROUP BY region

-- UNPIVOT: Convert columns to rows
UNPIVOT data ON q1, q2, q3, q4 INTO NAME quarter VALUE amount

-- Sampling: Random 10% with reproducible seed
SELECT * FROM large_table USING SAMPLE 10% REPEATABLE(42)

-- Dynamic columns: Exclude sensitive, transform email
SELECT * EXCLUDE (ssn) REPLACE (LOWER(email) AS email) FROM users
```

## Workflow

1. **Inspect schema**: `DESCRIBE SELECT * FROM 'file.csv'`
2. **Preview data**: `SELECT * FROM 'file.csv' LIMIT 5`
3. **Transform**: Apply filters, joins, aggregations
4. **Export** (optional): Add `output` to write results

## Error Handling

- If `error` is non-null: Check column names, verify paths
- If `truncated` is true: Use more aggregation or filters
- If write fails with "exists": Set `options.overwrite: true`

## Secrets

Store credentials securely in YAML. See [SECRETS.md](SECRETS.md) for complete documentation.

```json
{
  "query": "SELECT * FROM customers LIMIT 10",
  "secrets_file": "D:/path/to/secrets.yaml",
  "sources": [{
    "type": "postgres", "alias": "customers",
    "secret": "my_postgres", "table": "customers"
  }]
}
```

Supported: PostgreSQL, MySQL, S3, GCS, Azure, R2, HTTP, HuggingFace, Iceberg, DuckLake.
data-wrangler | SkillHub