Back to skills
SkillHub ClubAnalyze Data & AIFull StackData / AI

reshard-c4-data

Guidance for data resharding tasks that involve reorganizing files across directory structures with constraints on file sizes and directory contents. This skill applies when redistributing datasets, splitting large files, or reorganizing data into shards while maintaining constraints like maximum files per directory or maximum file sizes. Use when tasks involve resharding, data partitioning, or directory-constrained file reorganization.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars
781
Hot score
99
Updated
March 20, 2026
Overall rating
C4.3
Composite score
4.3
Best-practice grade
B84.8

Install command

npx @skill-hub/cli install benchflow-ai-skillsbench-reshard-c4-data

Repository

benchflow-ai/SkillsBench

Skill path: registry/terminal_bench_2.0/letta_skills_batch/terminal_bench_2_0_reshard-c4-data/environment/skills/reshard-c4-data

Guidance for data resharding tasks that involve reorganizing files across directory structures with constraints on file sizes and directory contents. This skill applies when redistributing datasets, splitting large files, or reorganizing data into shards while maintaining constraints like maximum files per directory or maximum file sizes. Use when tasks involve resharding, data partitioning, or directory-constrained file reorganization.

Open repository

Best for

Primary workflow: Analyze Data & AI.

Technical facets: Full Stack, Data / AI.

Target audience: everyone.

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: benchflow-ai.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

  • Install reshard-c4-data into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
  • Review https://github.com/benchflow-ai/SkillsBench before adding reshard-c4-data to shared team environments
  • Use reshard-c4-data for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: reshard-c4-data
description: Guidance for data resharding tasks that involve reorganizing files across directory structures with constraints on file sizes and directory contents. This skill applies when redistributing datasets, splitting large files, or reorganizing data into shards while maintaining constraints like maximum files per directory or maximum file sizes. Use when tasks involve resharding, data partitioning, or directory-constrained file reorganization.
---

# Reshard C4 Data

## Overview

This skill provides guidance for data resharding tasks where files must be reorganized across a directory structure while respecting constraints such as maximum file size limits and maximum items per directory. These tasks require careful consideration of how constraints apply recursively to all levels of the output structure.

## Critical Concept: Recursive Constraint Application

When a constraint states "maximum N files or folders in each directory," this applies to **ALL directories** in the output structure, including:

- The root output directory
- Any intermediate grouping directories
- The final shard/leaf directories

**Common Mistake:** Interpreting directory constraints as applying only to leaf directories while ignoring the root and intermediate levels.

### Mathematical Validation

Before implementing, perform constraint arithmetic:

1. Calculate the total number of output items (files, shards, or groups)
2. If items exceed the per-directory limit, hierarchical nesting is required
3. Recursively apply this calculation at each level

**Example:** For 9,898 files with a 30-item-per-directory limit:
- Files per shard: 30 → ~330 shards needed
- Shards per group: 30 → ~11 groups needed
- Groups in root: 11 → Compliant (under 30)

This yields a three-level hierarchy: `output/group_XXXX/shard_XXXX/files`

## Approach for Resharding Tasks

### Step 1: Analyze Input Data

- Inventory all input files (count, sizes, directory structure)
- Identify files exceeding size limits that require splitting
- Calculate total storage requirements

### Step 2: Validate Constraint Mathematics

For each constraint, verify compliance at all directory levels:

```
total_files = count(input_files) + count(split_chunks)
shards_needed = ceil(total_files / max_files_per_shard)

if shards_needed > max_items_per_directory:
    groups_needed = ceil(shards_needed / max_items_per_directory)
    # Continue nesting until root directory complies
```

### Step 3: Design Hierarchical Output Structure

Create a directory hierarchy that satisfies constraints at every level:

```
output/
  .metadata.json          # Reconstruction metadata
  group_0000/
    shard_0000/
      file_001.txt
      file_002.txt
      ... (up to max_files_per_shard)
    shard_0001/
    ... (up to max_items_per_directory shards)
  group_0001/
  ... (up to max_items_per_directory groups)
```

### Step 4: Implement File Distribution

- Split oversized files into chunks that comply with size limits
- Distribute files/chunks across shards evenly
- Track original file mappings in metadata for reconstruction
- Use checksums to verify data integrity

### Step 5: Generate Reconstruction Metadata

Include metadata that enables reversing the resharding:
- Original file paths and their shard locations
- Split file chunk mappings
- Checksums for integrity verification

## Verification Strategies

### Constraint Verification Checklist

Execute these checks before declaring success:

1. **Root directory item count:** `ls <output_dir> | wc -l` must be ≤ limit
2. **All intermediate directories:** Recursively verify each directory level
3. **All leaf directories:** Verify shard contents
4. **File size limits:** Check no single file exceeds the maximum
5. **Data integrity:** Verify checksums match originals

### Automated Verification Script Pattern

```python
def verify_directory_constraints(path, max_items):
    """Recursively verify all directories comply with item limit."""
    for root, dirs, files in os.walk(path):
        item_count = len(dirs) + len(files)
        if item_count > max_items:
            return False, f"{root} has {item_count} items (max: {max_items})"
    return True, "All directories compliant"
```

### Common Verification Failures

| Symptom | Likely Cause | Solution |
|---------|--------------|----------|
| Root directory exceeds limit | Flat shard structure | Add grouping hierarchy |
| Checksums don't match | File corruption during split | Re-implement split logic with verification |
| Missing files in reconstruction | Incomplete metadata | Audit metadata generation |

## Common Pitfalls

### Pitfall 1: Partial Constraint Interpretation

**Mistake:** Applying "max N items per directory" only to leaf directories.

**Prevention:** Explicitly verify every directory level in the output structure.

### Pitfall 2: Ignoring Metadata in Item Counts

**Mistake:** Forgetting that metadata files (e.g., `.metadata.json`) count toward directory limits.

**Prevention:** Include metadata files in constraint calculations.

### Pitfall 3: False Confidence from Partial Testing

**Mistake:** Concluding success after verifying data integrity but not structural constraints.

**Prevention:** Create separate verification steps for each constraint type.

### Pitfall 4: Underestimating Scale

**Mistake:** Testing with small datasets that don't trigger hierarchical nesting requirements.

**Prevention:** Calculate expected output structure mathematically before implementation.

## Testing Recommendations

1. **Unit test constraint logic** with edge cases (exactly at limits, one over, etc.)
2. **Test with representative scale** that triggers all hierarchy levels
3. **Verify round-trip integrity** by reconstructing and comparing checksums
4. **Check all constraint types independently** before integration testing
reshard-c4-data | SkillHub