data-wrangler
Transform and export data using DuckDB SQL. Read CSV/Parquet/JSON/Excel/databases, apply SQL transformations (joins, aggregations, PIVOT/UNPIVOT, sampling), and optionally write results to files. Use when the user wants to: (1) Clean, filter, or transform data, (2) Join multiple data sources, (3) Convert between formats (CSV→Parquet, etc.), (4) Create partitioned datasets, (5) Sample large datasets, (6) Export query results. Prefer this over in-context reasoning for datasets with thousands of rows or complex transformations.
Packaged view
This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.
Install command
npx @skill-hub/cli install benchflow-ai-skillsbench-data-wrangler
Repository
Skill path: registry/terminal_bench_2.0/full_batch_reviewed/terminal_bench_2_0_sqlite-db-truncate/environment/skills/data-wrangler
Transform and export data using DuckDB SQL. Read CSV/Parquet/JSON/Excel/databases, apply SQL transformations (joins, aggregations, PIVOT/UNPIVOT, sampling), and optionally write results to files. Use when the user wants to: (1) Clean, filter, or transform data, (2) Join multiple data sources, (3) Convert between formats (CSV→Parquet, etc.), (4) Create partitioned datasets, (5) Sample large datasets, (6) Export query results. Prefer this over in-context reasoning for datasets with thousands of rows or complex transformations.
Open repositoryBest for
Primary workflow: Analyze Data & AI.
Technical facets: Full Stack, Data / AI.
Target audience: everyone.
License: Unknown.
Original source
Catalog source: SkillHub Club.
Repository owner: benchflow-ai.
This is still a mirrored public skill entry. Review the repository before installing into production workflows.
What it helps with
- Install data-wrangler into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
- Review https://github.com/benchflow-ai/SkillsBench before adding data-wrangler to shared team environments
- Use data-wrangler for development workflows
Works across
Favorites: 0.
Sub-skills: 0.
Aggregator: No.
Original source / Raw SKILL.md
---
name: data-wrangler
description: >
Transform and export data using DuckDB SQL. Read CSV/Parquet/JSON/Excel/databases,
apply SQL transformations (joins, aggregations, PIVOT/UNPIVOT, sampling), and optionally
write results to files. Use when the user wants to: (1) Clean, filter, or transform data,
(2) Join multiple data sources, (3) Convert between formats (CSV→Parquet, etc.),
(4) Create partitioned datasets, (5) Sample large datasets, (6) Export query results.
Prefer this over in-context reasoning for datasets with thousands of rows or complex
transformations.
---
# Data Wrangler
Transform and export data using DuckDB SQL.
## Contents
- [Usage](#usage) - Command syntax and Windows escaping
- [Explore Mode](#explore-mode) - Quick data profiling
- [Query Mode](#query-mode) - Return results to Claude
- [Write Mode](#write-mode) - Export to files
- [Request/Response Format](#requestresponse-format) - JSON structure
- [Source Types](#source-types) - File, database, and cloud sources
- [Transformations](#transformations) - SQL patterns reference
- [Secrets](#secrets) - Secure credential handling
## Usage
**IMPORTANT - Windows Shell Escaping:**
1. Always `cd` to the skill directory first
2. Use **double quotes** for echo with escaped inner quotes (`\"`)
3. Use **forward slashes** in file paths
```bash
cd "<skill_directory>" && echo "{\"query\": \"SELECT * FROM 'D:/path/to/file.csv'\"}" | uv run scripts/query_duckdb.py
```
## Explore Mode
**Get schema, statistics, and sample in one call.** Use before writing queries to understand data structure.
```json
{"mode": "explore", "path": "D:/data/sales.csv"}
```
**Response:**
```json
{
"file": "D:/data/sales.csv",
"format": "csv",
"row_count": 15234,
"columns": [
{"name": "order_id", "type": "BIGINT", "null_count": 0, "null_percent": 0.0},
{"name": "customer", "type": "VARCHAR", "null_count": 45, "null_percent": 0.3}
],
"sample": "| order_id | customer | ... |\\n|----------|----------|-----|\\n| 1001 | Alice | ... |"
}
```
**Options:**
- `sample_rows`: Number of sample rows (default: 10, max: 100)
- `sources`: For database tables (same as query mode)
## Query Mode
Return results directly to Claude for analysis.
### Direct File Queries
```json
{"query": "SELECT * FROM 'data.csv' LIMIT 10"}
```
### Multi-Source Joins
```json
{
"query": "SELECT s.*, p.category FROM sales s JOIN products p ON s.product_id = p.id",
"sources": [
{"type": "file", "alias": "sales", "path": "/data/sales.parquet"},
{"type": "file", "alias": "products", "path": "/data/products.csv"}
]
}
```
## Write Mode
Export query results to files. Add an `output` object to write instead of returning data.
### Basic Write
```json
{
"query": "SELECT * FROM 'raw.csv' WHERE status = 'active'",
"output": {
"path": "D:/output/filtered.parquet",
"format": "parquet"
}
}
```
### Write with Options
```json
{
"query": "SELECT *, YEAR(date) as year, MONTH(date) as month FROM 'events.csv'",
"output": {
"path": "D:/output/events/",
"format": "parquet",
"options": {
"compression": "zstd",
"partition_by": ["year", "month"],
"overwrite": true
}
}
}
```
### Output Formats
| Format | Options |
|--------|---------|
| `parquet` | `compression` (zstd/snappy/gzip/lz4), `partition_by`, `row_group_size` |
| `csv` | `header` (default: true), `delimiter`, `compression`, `partition_by` |
| `json` | `array` (true=JSON array, false=newline-delimited) |
### Write Response
Response includes verification info - no need for follow-up queries:
```json
{
"success": true,
"output_path": "D:/output/events/",
"format": "parquet",
"rows_written": 15234,
"files_created": ["D:/output/events/year=2023/data_0.parquet", "..."],
"total_size_bytes": 5678901,
"duration_ms": 1234
}
```
### Overwrite Protection
By default, existing files are **not** overwritten. Set `options.overwrite: true` to allow.
## Request/Response Format
### Request
```json
{
"query": "SQL statement",
"sources": [...],
"output": {"path": "...", "format": "..."},
"options": {"max_rows": 200, "format": "markdown"},
"secrets_file": "path/to/secrets.yaml"
}
```
### Query Mode Options
- `max_rows`: Maximum rows to return (default: 200)
- `max_bytes`: Maximum response size (default: 200000)
- `format`: `markdown` (default), `json`, `records`, or `csv`
### Query Mode Response (markdown)
```
| column1 | column2 |
|---|---|
| value1 | value2 |
```
### Query Mode Response (json)
```json
{
"schema": [{"name": "col1", "type": "INTEGER"}],
"rows": [[1, "value"]],
"truncated": false,
"warnings": [],
"error": null
}
```
## Source Types
### File (auto-detects CSV, Parquet, JSON, Excel)
```json
{"type": "file", "alias": "data", "path": "/path/to/file.csv"}
```
Glob patterns: `{"path": "/logs/**/*.parquet"}`
Custom delimiter: `{"path": "/data/file.csv", "delimiter": "|"}`
### PostgreSQL
```json
{
"type": "postgres", "alias": "users",
"host": "host", "port": 5432, "database": "db",
"user": "user", "password": "pass",
"schema": "public", "table": "users"
}
```
### MySQL
```json
{
"type": "mysql", "alias": "orders",
"host": "host", "port": 3306, "database": "db",
"user": "user", "password": "pass", "table": "orders"
}
```
### SQLite
```json
{"type": "sqlite", "alias": "data", "path": "/path/to/db.sqlite", "table": "tablename"}
```
### S3
```json
{
"type": "s3", "alias": "logs",
"url": "s3://bucket/path/*.parquet",
"aws_region": "us-east-1",
"aws_access_key_id": "...", "aws_secret_access_key": "..."
}
```
## Transformations
See [TRANSFORMS.md](TRANSFORMS.md) for advanced patterns including:
- **PIVOT/UNPIVOT** - Reshape data between wide and long formats
- **Sampling** - Random subsets with `USING SAMPLE n ROWS` or `SAMPLE 10%`
- **Dynamic columns** - `EXCLUDE`, `REPLACE`, `COLUMNS('pattern')`
- **Window functions** - Running totals, rankings, moving averages
- **Date/time operations** - Extraction, arithmetic, formatting
### Quick Examples
```sql
-- PIVOT: Convert rows to columns
PIVOT sales ON quarter USING SUM(revenue) GROUP BY region
-- UNPIVOT: Convert columns to rows
UNPIVOT data ON q1, q2, q3, q4 INTO NAME quarter VALUE amount
-- Sampling: Random 10% with reproducible seed
SELECT * FROM large_table USING SAMPLE 10% REPEATABLE(42)
-- Dynamic columns: Exclude sensitive, transform email
SELECT * EXCLUDE (ssn) REPLACE (LOWER(email) AS email) FROM users
```
## Workflow
1. **Inspect schema**: `DESCRIBE SELECT * FROM 'file.csv'`
2. **Preview data**: `SELECT * FROM 'file.csv' LIMIT 5`
3. **Transform**: Apply filters, joins, aggregations
4. **Export** (optional): Add `output` to write results
## Error Handling
- If `error` is non-null: Check column names, verify paths
- If `truncated` is true: Use more aggregation or filters
- If write fails with "exists": Set `options.overwrite: true`
## Secrets
Store credentials securely in YAML. See [SECRETS.md](SECRETS.md) for complete documentation.
```json
{
"query": "SELECT * FROM customers LIMIT 10",
"secrets_file": "D:/path/to/secrets.yaml",
"sources": [{
"type": "postgres", "alias": "customers",
"secret": "my_postgres", "table": "customers"
}]
}
```
Supported: PostgreSQL, MySQL, S3, GCS, Azure, R2, HTTP, HuggingFace, Iceberg, DuckLake.