Back to skills
SkillHub ClubShip Full StackFull Stack
prepare-dataset
Imported from https://github.com/mvillmow/ProjectOdyssey.
Packaged view
This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.
Stars
14
Hot score
86
Updated
March 20, 2026
Overall rating
C4.0
Composite score
4.0
Best-practice grade
S96.0
Install command
npx @skill-hub/cli install mvillmow-projectodyssey-prepare-dataset
Repository
mvillmow/ProjectOdyssey
Skill path: .claude/skills/tier-2/prepare-dataset
Imported from https://github.com/mvillmow/ProjectOdyssey.
Open repositoryBest for
Primary workflow: Ship Full Stack.
Technical facets: Full Stack.
Target audience: everyone.
License: Unknown.
Original source
Catalog source: SkillHub Club.
Repository owner: mvillmow.
This is still a mirrored public skill entry. Review the repository before installing into production workflows.
What it helps with
- Install prepare-dataset into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
- Review https://github.com/mvillmow/ProjectOdyssey before adding prepare-dataset to shared team environments
- Use prepare-dataset for development workflows
Works across
Claude CodeCodex CLIGemini CLIOpenCode
Favorites: 0.
Sub-skills: 0.
Aggregator: No.
Original source / Raw SKILL.md
---
name: prepare-dataset
description: "Process and validate datasets for training. Use when setting up data pipelines."
mcp_fallback: none
category: ml
tier: 2
---
# Prepare Dataset
Load, preprocess, and validate datasets for machine learning model training including normalization and augmentation.
## When to Use
- Setting up data pipelines for training
- Normalizing and cleaning raw data
- Splitting into train/validation/test sets
- Applying data augmentation
## Quick Reference
```python
# Dataset preparation pipeline
class DatasetLoader:
def load(self, path: str) -> Tuple[ndarray, ndarray]:
# Load raw data
pass
def normalize(self, data: ndarray) -> ndarray:
# Normalize to [0, 1] or standardize
pass
def split(self, data: ndarray, ratios: Tuple[float, float, float]):
# Split into train/val/test
pass
def augment(self, data: ndarray) -> ndarray:
# Apply transformations if needed
pass
```
## Workflow
1. **Load raw data**: Read dataset from file (CSV, HDF5, NumPy)
2. **Validate data**: Check shape, dtype, missing values
3. **Preprocess**: Normalize, standardize, encode categorical features
4. **Split sets**: Create train/validation/test splits
5. **Augment data**: Apply transformations if needed (rotation, flip, etc.)
## Output Format
Dataset preparation report:
- Raw data shape and statistics
- Data validation results (missing values, outliers)
- Preprocessing applied (normalization, encoding)
- Train/val/test split sizes
- Final dataset shape and statistics
- Augmentation transformations applied
## References
- See `extract-hyperparameters` skill for data preprocessing config
- See `evaluate-model` skill for test set evaluation
- See `/notes/review/mojo-ml-patterns.md` for Mojo data loading