SkillHub ClubShip Full StackFull Stack

prepare-dataset

Imported from https://github.com/mvillmow/ProjectOdyssey.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars

Hot score

Updated

March 20, 2026

Overall rating

C4.0

Composite score

4.0

Best-practice grade

S96.0

Install command

npx @skill-hub/cli install mvillmow-projectodyssey-prepare-dataset

Repository

mvillmow/ProjectOdyssey

Skill path: .claude/skills/tier-2/prepare-dataset

Imported from https://github.com/mvillmow/ProjectOdyssey.

Open repository

Best for

Primary workflow: Ship Full Stack.

Technical facets: Full Stack.

Target audience: everyone.

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: mvillmow.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

Install prepare-dataset into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
Review https://github.com/mvillmow/ProjectOdyssey before adding prepare-dataset to shared team environments
Use prepare-dataset for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: prepare-dataset
description: "Process and validate datasets for training. Use when setting up data pipelines."
mcp_fallback: none
category: ml
tier: 2
---

# Prepare Dataset

Load, preprocess, and validate datasets for machine learning model training including normalization and augmentation.

## When to Use

- Setting up data pipelines for training
- Normalizing and cleaning raw data
- Splitting into train/validation/test sets
- Applying data augmentation

## Quick Reference

```python
# Dataset preparation pipeline
class DatasetLoader:
    def load(self, path: str) -> Tuple[ndarray, ndarray]:
        # Load raw data
        pass

    def normalize(self, data: ndarray) -> ndarray:
        # Normalize to [0, 1] or standardize
        pass

    def split(self, data: ndarray, ratios: Tuple[float, float, float]):
        # Split into train/val/test
        pass

    def augment(self, data: ndarray) -> ndarray:
        # Apply transformations if needed
        pass
```

## Workflow

1. **Load raw data**: Read dataset from file (CSV, HDF5, NumPy)
2. **Validate data**: Check shape, dtype, missing values
3. **Preprocess**: Normalize, standardize, encode categorical features
4. **Split sets**: Create train/validation/test splits
5. **Augment data**: Apply transformations if needed (rotation, flip, etc.)

## Output Format

Dataset preparation report:

- Raw data shape and statistics
- Data validation results (missing values, outliers)
- Preprocessing applied (normalization, encoding)
- Train/val/test split sizes
- Final dataset shape and statistics
- Augmentation transformations applied

## References

- See `extract-hyperparameters` skill for data preprocessing config
- See `evaluate-model` skill for test set evaluation
- See `/notes/review/mojo-ml-patterns.md` for Mojo data loading