ml-pipeline
Complete machine learning pipeline for trading: feature engineering, AutoML, deep learning, and financial RL. Use for automated parameter sweeps, feature creation, model training, and anti-leakage validation.
Packaged view
This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.
Install command
npx @skill-hub/cli install openclaw-skills-ml-pipeline
Repository
Skill path: skills/ahuserious/ml-pipeline
Complete machine learning pipeline for trading: feature engineering, AutoML, deep learning, and financial RL. Use for automated parameter sweeps, feature creation, model training, and anti-leakage validation.
Open repositoryBest for
Primary workflow: Analyze Data & AI.
Technical facets: Full Stack, Data / AI.
Target audience: everyone.
License: Unknown.
Original source
Catalog source: SkillHub Club.
Repository owner: openclaw.
This is still a mirrored public skill entry. Review the repository before installing into production workflows.
What it helps with
- Install ml-pipeline into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
- Review https://github.com/openclaw/skills before adding ml-pipeline to shared team environments
- Use ml-pipeline for development workflows
Works across
Favorites: 0.
Sub-skills: 0.
Aggregator: No.
Original source / Raw SKILL.md
---
name: ml-pipeline
description: >
Complete machine learning pipeline for trading: feature engineering, AutoML, deep learning, and financial RL.
Use for automated parameter sweeps, feature creation, model training, and anti-leakage validation.
version: "2.0.0"
allowed-tools: Read, Write, Edit, Bash, Glob, Grep
metadata:
consolidates:
- ml-feature-engineering
- deep-learning-optimizer-5
- pytorch-lightning-2
- scikit-learn-ml-framework
- automl-pipeline-builder-2
- ml-feature-engineering-helper
- ml-fundamentals
- machine-learning-feature-engineering-toolkit
---
# ML Pipeline
Unified skill for the complete ML pipeline within a quant trading research system.
Consolidates eight prior skills into a single authoritative reference covering
the full lifecycle: data validation, feature creation, selection,
transformation, anti-leakage checks, pipeline automation, deep learning optimization, and deployment.
---
## 1. When to Use
Activate this skill when the task involves any of the following:
- Creating, selecting, or transforming features for an ML-driven strategy.
- Auditing an existing feature pipeline for data leakage or overfitting risk.
- Automating an end-to-end ML pipeline (data prep through model export).
- Evaluating feature importance, scaling, encoding, or interaction effects.
- Integrating features with a feature store (Feast, Tecton, custom Parquet store).
- Explaining core ML concepts (bias-variance, cross-validation, regularisation)
in the context of feature engineering decisions.
---
## 2. Inputs to Gather
Before starting work, collect or confirm:
| Input | Details |
|-------|---------|
| **Objective** | Target metric (Sharpe, accuracy, RMSE ...), constraints, time horizon. |
| **Data** | Symbols / instruments, timeframe, bar type, sampling frequency, data sources. |
| **Leakage risks** | Point-in-time concerns, survivorship bias, look-ahead in labels or features. |
| **Compute budget** | CPU/GPU limits, wall-clock budget for AutoML search. |
| **Latency** | Online vs. offline inference, acceptable prediction latency. |
| **Interpretability** | Regulatory or research need for explainable features / models. |
| **Deployment target** | Where the model will run (notebook, backtest harness, live engine). |
---
## 3. Feature Creation Patterns
### 3.1 Numerical Features
- **Interaction terms**: `price * volume`, `high / low`, `close - open`.
- **Rolling statistics**: mean, std, skew, kurtosis over configurable windows.
- **Polynomial / log transforms**: `log(volume + 1)`, `spread^2`.
- **Binning / discretisation**: equal-width, quantile-based, or domain-driven bins.
### 3.2 Categorical Features
- **One-hot encoding**: for low-cardinality categoricals (sector, exchange).
- **Target encoding**: mean-target per category with smoothing (careful of leakage -- use only in-fold means).
- **Ordinal encoding**: when categories have a natural order (credit rating).
### 3.3 Time-Series Specific
- **Lag features**: `return_{t-1}`, `return_{t-5}`, etc.
- **Calendar features**: day-of-week, month, quarter, options-expiry flag.
- **Rolling z-score**: `(x - rolling_mean) / rolling_std` for stationarity.
- **Fractional differentiation**: preserve memory while achieving stationarity (Lopez de Prado).
### 3.4 Feature Selection Techniques
- **Filter methods**: mutual information, variance threshold, correlation pruning.
- **Wrapper methods**: recursive feature elimination (RFE), forward/backward selection.
- **Embedded methods**: L1 regularisation, tree-based importance, SHAP values.
- **Permutation importance**: model-agnostic; run on out-of-fold predictions.
---
## 4. Anti-Leakage Checks
Data leakage is the single most common cause of inflated backtest results.
Apply these checks at every pipeline stage:
### 4.1 Label Leakage
- Labels must be computed from **future** returns relative to the feature
timestamp. Verify that the label window does not overlap the feature window.
- Use purging and embargo when labels span multiple bars.
### 4.2 Feature Leakage
- No feature may use information from time `t+1` or later at prediction time `t`.
- Rolling statistics must use a **closed** left window: `df['feat'].rolling(20).mean().shift(1)`.
- Target-encoded categoricals must be computed on the **training fold only**.
### 4.3 Cross-Validation Leakage
- Use **purged k-fold** or **walk-forward** CV for time-series. Never use random
k-fold on ordered data.
- Insert an **embargo gap** between train and test folds to prevent bleed-through
from autocorrelation.
### 4.4 Survivorship & Selection Bias
- Ensure the universe of instruments at time `t` reflects what was actually
tradable at that time (delisted stocks, halted symbols removed later).
- Backfill from point-in-time databases where available.
### 4.5 Validation Checklist
Run before every backtest:
```text
[ ] Labels computed strictly from future returns (no overlap with features)
[ ] All rolling features shifted by at least 1 bar
[ ] Target encoding uses in-fold means only
[ ] Walk-forward or purged CV used (no random shuffle on time-series)
[ ] Embargo gap >= max(label_horizon, autocorrelation_lag)
[ ] Universe is point-in-time (no survivorship bias)
[ ] No global scaling fitted on full dataset (fit on train, transform test)
```
---
## 5. Pipeline Automation (AutoML)
### 5.1 Prerequisites
- Python environment with one or more AutoML libraries:
Auto-sklearn, TPOT, H2O AutoML, PyCaret, Optuna, or custom Optuna pipelines.
- Training data in CSV / Parquet / database.
- Problem type identified: classification, regression, or time-series forecasting.
### 5.2 Pipeline Steps
| Step | Action |
|------|--------|
| **1. Define requirements** | Problem type, evaluation metric, time/resource budget, interpretability needs. |
| **2. Data infrastructure** | Load data, quality assessment, train/val/test split strategy, define feature transforms. |
| **3. Configure AutoML** | Select framework, define algorithm search space, set preprocessing steps, choose tuning strategy (Bayesian, random, Hyperband). |
| **4. Execute training** | Run automated feature engineering, model selection, hyperparameter optimisation, cross-validation. |
| **5. Analyse & export** | Compare models, extract best config, feature importance, visualisations, export for deployment. |
### 5.3 Pipeline Configuration Template
```python
pipeline_config = {
"task_type": "classification", # or "regression", "time_series"
"time_budget_seconds": 3600,
"algorithms": ["rf", "xgboost", "catboost", "lightgbm"],
"preprocessing": ["scaling", "encoding", "imputation"],
"tuning_strategy": "bayesian", # or "random", "hyperband"
"cv_folds": 5,
"cv_type": "purged_kfold", # or "walk_forward"
"embargo_bars": 10,
"early_stopping_rounds": 50,
"metric": "sharpe_ratio", # domain-specific metric
}
```
### 5.4 Output Artifacts
- `automl_config.py` -- pipeline configuration.
- `best_model.pkl` / `.joblib` / `.onnx` -- serialised model.
- `feature_pipeline.pkl` -- fitted preprocessing + feature transforms.
- `evaluation_report.json` -- metrics, confusion matrix / residuals, feature rankings.
- `deployment/` -- prediction API code, input validation, requirements.txt.
---
## 6. Core ML Fundamentals (Feature-Engineering Context)
### 6.1 Bias-Variance Trade-off
- More features increase model capacity (lower bias) but risk overfitting (higher variance).
- Use regularisation (L1/L2), feature selection, or dimensionality reduction to manage.
### 6.2 Evaluation Strategy
- **Walk-forward validation**: the gold standard for time-series strategies.
Roll a fixed-width training window forward; test on the next out-of-sample period.
- **Monte Carlo permutation tests**: shuffle labels and re-evaluate to estimate
the probability that observed performance is due to chance.
- **Combinatorial purged CV (CPCV)**: generate many train/test combinations with
purging for more robust performance estimates.
### 6.3 Feature Scaling
- Fit scalers (StandardScaler, MinMaxScaler, RobustScaler) on the **training set only**.
- Apply the same fitted scaler to validation and test sets.
- RobustScaler is often preferred for financial data due to heavy tails.
### 6.4 Handling Missing Data
- Forward-fill then backward-fill for price data (be aware of leakage on backfill).
- Indicator column for missingness can itself be informative.
- Tree-based models can handle NaN natively; linear models cannot.
---
## 7. Workflow
For any feature engineering task, follow this sequence:
1. **Restate** the task in measurable terms (metric, constraints, deadline).
2. **Enumerate** required artifacts: datasets, feature definitions, configs, scripts, reports.
3. **Propose** a default approach and 1-2 alternatives with trade-offs.
4. **Implement** feature pipeline with anti-leakage checks built in.
5. **Validate** with walk-forward CV, Monte Carlo, and the leakage checklist above.
6. **Deliver** repo-ready code, documentation, and a run command.
---
## 8. Deep Learning Optimization
### 8.1 Optimizer Selection
| Optimizer | Best For | Learning Rate |
|-----------|----------|---------------|
| Adam | Most cases, adaptive | 1e-3 to 1e-4 |
| AdamW | Transformers, weight decay | 1e-4 to 1e-5 |
| SGD + Momentum | Large batches, fine-tuning | 1e-2 to 1e-3 |
| RAdam | Stability without warmup | 1e-3 |
### 8.2 Learning Rate Scheduling
- **OneCycleLR**: Best for short training, fast convergence
- **CosineAnnealing**: Smooth decay, good generalization
- **ReduceOnPlateau**: Adaptive when validation loss plateaus
- **Warmup + Decay**: Standard for transformers
### 8.3 Regularization Techniques
- **Dropout**: 0.1-0.5 for fully connected layers
- **L2 (Weight Decay)**: 1e-4 to 1e-2
- **Batch Normalization**: Stabilizes training
- **Early Stopping**: Monitor validation loss, patience 5-10 epochs
### 8.4 PyTorch Lightning Integration
```python
import pytorch_lightning as pl
class TradingModel(pl.LightningModule):
def configure_optimizers(self):
optimizer = torch.optim.AdamW(self.parameters(), lr=1e-4)
scheduler = torch.optim.lr_scheduler.OneCycleLR(
optimizer, max_lr=1e-3, total_steps=self.trainer.estimated_stepping_batches
)
return [optimizer], [scheduler]
```
### 8.5 Financial Reinforcement Learning
- **State**: Market features, portfolio state, position
- **Action**: Buy/Sell/Hold, position sizing
- **Reward**: Risk-adjusted returns (Sharpe, Sortino)
- **Frameworks**: Stable-Baselines3, RLlib, FinRL
---
## 9. Error Handling
| Problem | Cause | Fix |
|---------|-------|-----|
| AutoML search finds no good model | Insufficient time budget or poor features | Increase budget, engineer better features, expand algorithm search space. |
| Out of memory during training | Dataset too large for available RAM | Downsample, use incremental learning, simplify feature engineering. |
| Model accuracy below threshold | Weak signal or overfitting | Collect more data, add domain-driven features, regularise, adjust metric. |
| Feature transforms produce NaN/Inf | Division by zero, log of negative | Add guards: `np.where(denom != 0, ...)`, `np.log1p(np.abs(x))`. |
| Optimiser fails to converge | Bad hyperparameter ranges | Tighten search bounds, increase iterations, exclude unstable algorithms. |
---
## 10. Bundled Scripts
All scripts live in `scripts/` within this skill directory.
| Script | Purpose |
|--------|---------|
| `data_validation.py` | Validate input data quality before pipeline execution. |
| `model_evaluation.py` | Evaluate trained model performance and generate reports. |
| `pipeline_deployment.py` | Deploy a trained pipeline to a target environment with rollback support. |
| `feature_engineering_pipeline.py` | End-to-end feature engineering: load, clean, transform, select, train. |
| `feature_importance_analyzer.py` | Analyse feature importance (permutation, SHAP, tree-based). |
| `data_visualizer.py` | Visualise feature distributions and relationships to target. |
| `feature_store_integration.py` | Integrate with feature stores (Feast, Tecton) for online/offline serving. |
---
## 11. Resources
### Frameworks
- **scikit-learn** -- preprocessing, feature selection, pipelines.
- **Auto-sklearn / TPOT / H2O AutoML / PyCaret** -- automated pipeline search.
- **Optuna** -- flexible hyperparameter optimisation.
- **SHAP** -- model-agnostic feature importance.
- **Feast / Tecton** -- feature store management.
- **PyTorch Lightning** -- https://lightning.ai/docs/pytorch/stable/
- **Stable-Baselines3** -- https://stable-baselines3.readthedocs.io/
- **FinRL** -- https://github.com/AI4Finance-Foundation/FinRL
### Key References
- Lopez de Prado, *Advances in Financial Machine Learning* (2018) -- purged CV, fractional differentiation, meta-labelling.
- Hastie, Tibshirani & Friedman, *The Elements of Statistical Learning* -- bias-variance, regularisation, model selection.
- scikit-learn user guide: feature extraction, preprocessing, model selection.
### Best Practices
- Always start with a simple baseline before running AutoML.
- Balance automation with domain knowledge -- blind search rarely beats informed priors.
- Monitor resource consumption; set hard timeouts.
- Validate on true out-of-sample holdout data, not just cross-validation.
- Document every pipeline decision for reproducibility.
---
## Skill Companion Files
> Additional files collected from the skill directory layout.
### _meta.json
```json
{
"owner": "ahuserious",
"slug": "ml-pipeline",
"displayName": "ML Pipeline",
"latest": {
"version": "0.1.0",
"publishedAt": 1772213873706,
"commit": "https://github.com/openclaw/skills/commit/19dda76bbd45510fc4f793568cf08e8c72f43ca8"
},
"history": []
}
```
### assets/README.md
```markdown
# Assets -- ml-feature-engineering
Place static assets here:
- Pipeline configuration templates (YAML, JSON).
- Jupyter notebook templates for feature exploration.
- HTML report templates for model evaluation output.
- Diagram sources for pipeline architecture documentation.
```
### references/README.md
```markdown
# References -- ml-feature-engineering
Place reference materials here:
- Research papers on feature engineering for financial ML.
- Configuration templates and example pipeline configs.
- Links to external documentation (scikit-learn, SHAP, Optuna, etc.).
## Key References
- Lopez de Prado, *Advances in Financial Machine Learning* (2018).
- Hastie, Tibshirani & Friedman, *The Elements of Statistical Learning*.
- scikit-learn user guide: https://scikit-learn.org/stable/user_guide.html
- SHAP documentation: https://shap.readthedocs.io/
```
### scripts/README.md
```markdown
# Scripts -- ml-feature-engineering
Bundled utility scripts for the consolidated ML Feature Engineering skill.
## Pipeline Automation (from automl-pipeline-builder-2)
- **data_validation.py** -- Validate input data quality before pipeline execution.
- **model_evaluation.py** -- Evaluate trained model performance and generate reports.
- **pipeline_deployment.py** -- Deploy a trained pipeline with rollback support.
## Feature Engineering (from machine-learning-feature-engineering-toolkit)
- **feature_engineering_pipeline.py** -- End-to-end feature engineering process.
- **feature_importance_analyzer.py** -- Analyse feature importance (permutation, SHAP, tree-based).
- **data_visualizer.py** -- Visualise feature distributions and target relationships.
- **feature_store_integration.py** -- Integrate with feature stores (Feast, Tecton).
## Usage
All scripts accept `--help` for argument documentation:
```bash
python scripts/data_validation.py --help
python scripts/feature_importance_analyzer.py --help
```
```
### scripts/data_validation.py
```python
#!/usr/bin/env python3
"""
automl-pipeline-builder - data_validation.py
Script to validate input data for the AutoML pipeline, ensuring data quality and preventing errors.
Generated: 2025-12-10 03:48:17
"""
import os
import sys
import json
import argparse
from pathlib import Path
from datetime import datetime
def process_file(file_path: Path) -> bool:
"""Process individual file."""
if not file_path.exists():
print(f"ā File not found: {file_path}")
return False
print(f"š Processing: {file_path}")
# Add processing logic here based on skill requirements
# This is a template that can be customized
try:
if file_path.suffix == '.json':
with open(file_path) as f:
data = json.load(f)
print(f" ā Valid JSON with {len(data)} keys")
else:
size = file_path.stat().st_size
print(f" ā File size: {size:,} bytes")
return True
except Exception as e:
print(f" ā Error: {e}")
return False
def process_directory(dir_path: Path) -> int:
"""Process all files in directory."""
processed = 0
failed = 0
for file_path in dir_path.rglob('*'):
if file_path.is_file():
if process_file(file_path):
processed += 1
else:
failed += 1
return processed, failed
def main():
parser = argparse.ArgumentParser(
description="Script to validate input data for the AutoML pipeline, ensuring data quality and preventing errors."
)
parser.add_argument('input', help='Input file or directory')
parser.add_argument('--output', '-o', help='Output directory')
parser.add_argument('--verbose', '-v', action='store_true', help='Verbose output')
parser.add_argument('--config', '-c', help='Configuration file')
args = parser.parse_args()
input_path = Path(args.input)
print(f"š automl-pipeline-builder - data_validation.py")
print(f" Category: ai-ml")
print(f" Plugin: automl-pipeline-builder")
print(f" Input: {input_path}")
if args.config:
if Path(args.config).exists():
with open(args.config) as f:
config = json.load(f)
print(f" Config: {args.config}")
# Process input
if input_path.is_file():
success = process_file(input_path)
result = 0 if success else 1
elif input_path.is_dir():
processed, failed = process_directory(input_path)
print(f"\nš SUMMARY")
print(f" ā
Processed: {processed}")
print(f" ā Failed: {failed}")
result = 0 if failed == 0 else 1
else:
print(f"ā Invalid input: {input_path}")
result = 1
if result == 0:
print("\nā
Completed successfully")
else:
print("\nā Completed with errors")
return result
if __name__ == "__main__":
sys.exit(main())
```
### scripts/data_visualizer.py
```python
#!/usr/bin/env python3
"""
feature-engineering-toolkit - data_visualizer.py
Generates visualizations of features and their relationships to the target variable, aiding in understanding data patterns.
Generated: 2025-12-10 03:48:17
"""
import os
import sys
import json
import argparse
from pathlib import Path
from datetime import datetime
def process_file(file_path: Path) -> bool:
"""Process individual file."""
if not file_path.exists():
print(f"ā File not found: {file_path}")
return False
print(f"š Processing: {file_path}")
# Add processing logic here based on skill requirements
# This is a template that can be customized
try:
if file_path.suffix == '.json':
with open(file_path) as f:
data = json.load(f)
print(f" ā Valid JSON with {len(data)} keys")
else:
size = file_path.stat().st_size
print(f" ā File size: {size:,} bytes")
return True
except Exception as e:
print(f" ā Error: {e}")
return False
def process_directory(dir_path: Path) -> int:
"""Process all files in directory."""
processed = 0
failed = 0
for file_path in dir_path.rglob('*'):
if file_path.is_file():
if process_file(file_path):
processed += 1
else:
failed += 1
return processed, failed
def main():
parser = argparse.ArgumentParser(
description="Generates visualizations of features and their relationships to the target variable, aiding in understanding data patterns."
)
parser.add_argument('input', help='Input file or directory')
parser.add_argument('--output', '-o', help='Output directory')
parser.add_argument('--verbose', '-v', action='store_true', help='Verbose output')
parser.add_argument('--config', '-c', help='Configuration file')
args = parser.parse_args()
input_path = Path(args.input)
print(f"š feature-engineering-toolkit - data_visualizer.py")
print(f" Category: ai-ml")
print(f" Plugin: feature-engineering-toolkit")
print(f" Input: {input_path}")
if args.config:
if Path(args.config).exists():
with open(args.config) as f:
config = json.load(f)
print(f" Config: {args.config}")
# Process input
if input_path.is_file():
success = process_file(input_path)
result = 0 if success else 1
elif input_path.is_dir():
processed, failed = process_directory(input_path)
print(f"\nš SUMMARY")
print(f" ā
Processed: {processed}")
print(f" ā Failed: {failed}")
result = 0 if failed == 0 else 1
else:
print(f"ā Invalid input: {input_path}")
result = 1
if result == 0:
print("\nā
Completed successfully")
else:
print("\nā Completed with errors")
return result
if __name__ == "__main__":
sys.exit(main())
```
### scripts/feature_engineering_pipeline.py
```python
#!/usr/bin/env python3
"""
feature-engineering-toolkit - feature_engineering_pipeline.py
Automates the entire feature engineering process, including data loading, cleaning, transformation, selection, and model training.
Generated: 2025-12-10 03:48:17
"""
import os
import sys
import json
import argparse
from pathlib import Path
from datetime import datetime
def process_file(file_path: Path) -> bool:
"""Process individual file."""
if not file_path.exists():
print(f"ā File not found: {file_path}")
return False
print(f"š Processing: {file_path}")
# Add processing logic here based on skill requirements
# This is a template that can be customized
try:
if file_path.suffix == '.json':
with open(file_path) as f:
data = json.load(f)
print(f" ā Valid JSON with {len(data)} keys")
else:
size = file_path.stat().st_size
print(f" ā File size: {size:,} bytes")
return True
except Exception as e:
print(f" ā Error: {e}")
return False
def process_directory(dir_path: Path) -> int:
"""Process all files in directory."""
processed = 0
failed = 0
for file_path in dir_path.rglob('*'):
if file_path.is_file():
if process_file(file_path):
processed += 1
else:
failed += 1
return processed, failed
def main():
parser = argparse.ArgumentParser(
description="Automates the entire feature engineering process, including data loading, cleaning, transformation, selection, and model training."
)
parser.add_argument('input', help='Input file or directory')
parser.add_argument('--output', '-o', help='Output directory')
parser.add_argument('--verbose', '-v', action='store_true', help='Verbose output')
parser.add_argument('--config', '-c', help='Configuration file')
args = parser.parse_args()
input_path = Path(args.input)
print(f"š feature-engineering-toolkit - feature_engineering_pipeline.py")
print(f" Category: ai-ml")
print(f" Plugin: feature-engineering-toolkit")
print(f" Input: {input_path}")
if args.config:
if Path(args.config).exists():
with open(args.config) as f:
config = json.load(f)
print(f" Config: {args.config}")
# Process input
if input_path.is_file():
success = process_file(input_path)
result = 0 if success else 1
elif input_path.is_dir():
processed, failed = process_directory(input_path)
print(f"\nš SUMMARY")
print(f" ā
Processed: {processed}")
print(f" ā Failed: {failed}")
result = 0 if failed == 0 else 1
else:
print(f"ā Invalid input: {input_path}")
result = 1
if result == 0:
print("\nā
Completed successfully")
else:
print("\nā Completed with errors")
return result
if __name__ == "__main__":
sys.exit(main())
```
### scripts/feature_importance_analyzer.py
```python
#!/usr/bin/env python3
"""
feature-engineering-toolkit - Analysis Script
Analyzes feature importance using various techniques (e.g., permutation importance, SHAP values) and provides insights into which features are most influential.
Generated: 2025-12-10 03:48:17
"""
import os
import json
import argparse
from pathlib import Path
from typing import Dict, List
from datetime import datetime
class Analyzer:
def __init__(self, target_path: str):
self.target_path = Path(target_path)
self.stats = {
'total_files': 0,
'total_size': 0,
'file_types': {},
'issues': [],
'recommendations': []
}
def analyze_directory(self) -> Dict:
"""Analyze directory structure and contents."""
if not self.target_path.exists():
self.stats['issues'].append(f"Path does not exist: {self.target_path}")
return self.stats
for file_path in self.target_path.rglob('*'):
if file_path.is_file():
self.analyze_file(file_path)
return self.stats
def analyze_file(self, file_path: Path):
"""Analyze individual file."""
self.stats['total_files'] += 1
self.stats['total_size'] += file_path.stat().st_size
# Track file types
ext = file_path.suffix.lower()
if ext:
self.stats['file_types'][ext] = self.stats['file_types'].get(ext, 0) + 1
# Check for potential issues
if file_path.stat().st_size > 100 * 1024 * 1024: # 100MB
self.stats['issues'].append(f"Large file: {file_path} ({file_path.stat().st_size // 1024 // 1024}MB)")
if file_path.stat().st_size == 0:
self.stats['issues'].append(f"Empty file: {file_path}")
def generate_recommendations(self):
"""Generate recommendations based on analysis."""
if self.stats['total_files'] == 0:
self.stats['recommendations'].append("No files found - check target path")
if len(self.stats['file_types']) > 20:
self.stats['recommendations'].append("Many file types detected - consider organizing")
if self.stats['total_size'] > 1024 * 1024 * 1024: # 1GB
self.stats['recommendations'].append("Large total size - consider archiving old data")
def generate_report(self) -> str:
"""Generate analysis report."""
report = []
report.append("\n" + "="*60)
report.append(f"ANALYSIS REPORT - feature-engineering-toolkit")
report.append("="*60)
report.append(f"Target: {self.target_path}")
report.append(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
report.append("")
# Statistics
report.append("š STATISTICS")
report.append(f" Total Files: {self.stats['total_files']:,}")
report.append(f" Total Size: {self.stats['total_size'] / 1024 / 1024:.2f} MB")
report.append(f" File Types: {len(self.stats['file_types'])}")
# Top file types
if self.stats['file_types']:
report.append("\nš TOP FILE TYPES")
sorted_types = sorted(self.stats['file_types'].items(), key=lambda x: x[1], reverse=True)[:5]
for ext, count in sorted_types:
report.append(f" {ext or 'no extension'}: {count} files")
# Issues
if self.stats['issues']:
report.append(f"\nā ļø ISSUES ({len(self.stats['issues'])})")
for issue in self.stats['issues'][:10]:
report.append(f" - {issue}")
if len(self.stats['issues']) > 10:
report.append(f" ... and {len(self.stats['issues']) - 10} more")
# Recommendations
if self.stats['recommendations']:
report.append("\nš” RECOMMENDATIONS")
for rec in self.stats['recommendations']:
report.append(f" - {rec}")
report.append("")
return "\n".join(report)
def main():
parser = argparse.ArgumentParser(description="Analyzes feature importance using various techniques (e.g., permutation importance, SHAP values) and provides insights into which features are most influential.")
parser.add_argument('target', help='Target directory to analyze')
parser.add_argument('--output', '-o', help='Output report file')
parser.add_argument('--json', action='store_true', help='Output as JSON')
args = parser.parse_args()
print(f"š Analyzing {args.target}...")
analyzer = Analyzer(args.target)
stats = analyzer.analyze_directory()
analyzer.generate_recommendations()
if args.json:
output = json.dumps(stats, indent=2)
else:
output = analyzer.generate_report()
if args.output:
Path(args.output).write_text(output)
print(f"ā Report saved to {args.output}")
else:
print(output)
return 0 if len(stats['issues']) == 0 else 1
if __name__ == "__main__":
import sys
sys.exit(main())
```
### scripts/feature_store_integration.py
```python
#!/usr/bin/env python3
"""
feature-engineering-toolkit - feature_store_integration.py
Integrates with feature stores (e.g., Feast, Tecton) to manage and serve features for online and offline model deployment.
Generated: 2025-12-10 03:48:17
"""
import os
import sys
import json
import argparse
from pathlib import Path
from datetime import datetime
def process_file(file_path: Path) -> bool:
"""Process individual file."""
if not file_path.exists():
print(f"ā File not found: {file_path}")
return False
print(f"š Processing: {file_path}")
# Add processing logic here based on skill requirements
# This is a template that can be customized
try:
if file_path.suffix == '.json':
with open(file_path) as f:
data = json.load(f)
print(f" ā Valid JSON with {len(data)} keys")
else:
size = file_path.stat().st_size
print(f" ā File size: {size:,} bytes")
return True
except Exception as e:
print(f" ā Error: {e}")
return False
def process_directory(dir_path: Path) -> int:
"""Process all files in directory."""
processed = 0
failed = 0
for file_path in dir_path.rglob('*'):
if file_path.is_file():
if process_file(file_path):
processed += 1
else:
failed += 1
return processed, failed
def main():
parser = argparse.ArgumentParser(
description="Integrates with feature stores (e.g., Feast, Tecton) to manage and serve features for online and offline model deployment."
)
parser.add_argument('input', help='Input file or directory')
parser.add_argument('--output', '-o', help='Output directory')
parser.add_argument('--verbose', '-v', action='store_true', help='Verbose output')
parser.add_argument('--config', '-c', help='Configuration file')
args = parser.parse_args()
input_path = Path(args.input)
print(f"š feature-engineering-toolkit - feature_store_integration.py")
print(f" Category: ai-ml")
print(f" Plugin: feature-engineering-toolkit")
print(f" Input: {input_path}")
if args.config:
if Path(args.config).exists():
with open(args.config) as f:
config = json.load(f)
print(f" Config: {args.config}")
# Process input
if input_path.is_file():
success = process_file(input_path)
result = 0 if success else 1
elif input_path.is_dir():
processed, failed = process_directory(input_path)
print(f"\nš SUMMARY")
print(f" ā
Processed: {processed}")
print(f" ā Failed: {failed}")
result = 0 if failed == 0 else 1
else:
print(f"ā Invalid input: {input_path}")
result = 1
if result == 0:
print("\nā
Completed successfully")
else:
print("\nā Completed with errors")
return result
if __name__ == "__main__":
sys.exit(main())
```
### scripts/model_evaluation.py
```python
#!/usr/bin/env python3
"""
automl-pipeline-builder - model_evaluation.py
Script to evaluate the performance of the trained AutoML model using various metrics and generate a report.
Generated: 2025-12-10 03:48:17
"""
import os
import sys
import json
import argparse
from pathlib import Path
from datetime import datetime
def process_file(file_path: Path) -> bool:
"""Process individual file."""
if not file_path.exists():
print(f"ā File not found: {file_path}")
return False
print(f"š Processing: {file_path}")
# Add processing logic here based on skill requirements
# This is a template that can be customized
try:
if file_path.suffix == '.json':
with open(file_path) as f:
data = json.load(f)
print(f" ā Valid JSON with {len(data)} keys")
else:
size = file_path.stat().st_size
print(f" ā File size: {size:,} bytes")
return True
except Exception as e:
print(f" ā Error: {e}")
return False
def process_directory(dir_path: Path) -> int:
"""Process all files in directory."""
processed = 0
failed = 0
for file_path in dir_path.rglob('*'):
if file_path.is_file():
if process_file(file_path):
processed += 1
else:
failed += 1
return processed, failed
def main():
parser = argparse.ArgumentParser(
description="Script to evaluate the performance of the trained AutoML model using various metrics and generate a report."
)
parser.add_argument('input', help='Input file or directory')
parser.add_argument('--output', '-o', help='Output directory')
parser.add_argument('--verbose', '-v', action='store_true', help='Verbose output')
parser.add_argument('--config', '-c', help='Configuration file')
args = parser.parse_args()
input_path = Path(args.input)
print(f"š automl-pipeline-builder - model_evaluation.py")
print(f" Category: ai-ml")
print(f" Plugin: automl-pipeline-builder")
print(f" Input: {input_path}")
if args.config:
if Path(args.config).exists():
with open(args.config) as f:
config = json.load(f)
print(f" Config: {args.config}")
# Process input
if input_path.is_file():
success = process_file(input_path)
result = 0 if success else 1
elif input_path.is_dir():
processed, failed = process_directory(input_path)
print(f"\nš SUMMARY")
print(f" ā
Processed: {processed}")
print(f" ā Failed: {failed}")
result = 0 if failed == 0 else 1
else:
print(f"ā Invalid input: {input_path}")
result = 1
if result == 0:
print("\nā
Completed successfully")
else:
print("\nā Completed with errors")
return result
if __name__ == "__main__":
sys.exit(main())
```
### scripts/pipeline_deployment.py
```python
#!/usr/bin/env python3
"""
automl-pipeline-builder - Deployment Script
Script to deploy the trained AutoML pipeline to a production environment.
Generated: 2025-12-10 03:48:17
"""
import os
import json
import shutil
import argparse
from pathlib import Path
from datetime import datetime
class Deployer:
def __init__(self, source: str, target: str):
self.source = Path(source)
self.target = Path(target)
self.deployed = []
self.failed = []
def validate_source(self) -> bool:
"""Validate source directory exists."""
if not self.source.exists():
print(f"ā Source directory not found: {self.source}")
return False
if not any(self.source.iterdir()):
print(f"ā Source directory is empty: {self.source}")
return False
print(f"ā Source validated: {self.source}")
return True
def prepare_target(self) -> bool:
"""Prepare target directory."""
try:
self.target.mkdir(parents=True, exist_ok=True)
# Create deployment metadata
metadata = {
"deployment_time": datetime.now().isoformat(),
"source": str(self.source),
"skill": "automl-pipeline-builder",
"category": "ai-ml",
"plugin": "automl-pipeline-builder"
}
metadata_file = self.target / ".deployment.json"
with open(metadata_file, 'w') as f:
json.dump(metadata, f, indent=2)
print(f"ā Target prepared: {self.target}")
return True
except Exception as e:
print(f"ā Failed to prepare target: {e}")
return False
def deploy_files(self) -> bool:
"""Deploy files from source to target."""
success = True
for source_file in self.source.rglob('*'):
if source_file.is_file():
relative_path = source_file.relative_to(self.source)
target_file = self.target / relative_path
try:
target_file.parent.mkdir(parents=True, exist_ok=True)
shutil.copy2(source_file, target_file)
self.deployed.append(str(relative_path))
print(f" ā Deployed: {relative_path}")
except Exception as e:
self.failed.append({
"file": str(relative_path),
"error": str(e)
})
print(f" ā Failed: {relative_path} - {e}")
success = False
return success
def generate_report(self) -> Dict:
"""Generate deployment report."""
report = {
"deployment_time": datetime.now().isoformat(),
"skill": "automl-pipeline-builder",
"source": str(self.source),
"target": str(self.target),
"total_files": len(self.deployed) + len(self.failed),
"deployed": len(self.deployed),
"failed": len(self.failed),
"deployed_files": self.deployed,
"failed_files": self.failed
}
# Save report
report_file = self.target / "deployment_report.json"
with open(report_file, 'w') as f:
json.dump(report, f, indent=2)
return report
def rollback(self):
"""Rollback deployment on failure."""
print("ā ļø Rolling back deployment...")
for deployed_file in self.deployed:
file_path = self.target / deployed_file
if file_path.exists():
file_path.unlink()
print(f" ā Removed: {deployed_file}")
# Remove empty directories
for dir_path in sorted(self.target.rglob('*'), reverse=True):
if dir_path.is_dir() and not any(dir_path.iterdir()):
dir_path.rmdir()
def main():
parser = argparse.ArgumentParser(description="Script to deploy the trained AutoML pipeline to a production environment.")
parser.add_argument('source', help='Source directory')
parser.add_argument('target', help='Target deployment directory')
parser.add_argument('--dry-run', action='store_true', help='Simulate deployment')
parser.add_argument('--force', action='store_true', help='Overwrite existing files')
parser.add_argument('--rollback-on-error', action='store_true', help='Rollback on any error')
args = parser.parse_args()
deployer = Deployer(args.source, args.target)
print(f"š Deploying automl-pipeline-builder...")
print(f" Source: {args.source}")
print(f" Target: {args.target}")
if args.dry_run:
print("\nā ļø DRY RUN MODE - No files will be deployed")
return 0
# Validate and prepare
if not deployer.validate_source():
return 1
if not deployer.prepare_target():
return 1
# Deploy
success = deployer.deploy_files()
# Generate report
report = deployer.generate_report()
print(f"\nš DEPLOYMENT SUMMARY")
print(f" Total Files: {report['total_files']}")
print(f" ā
Deployed: {report['deployed']}")
print(f" ā Failed: {report['failed']}")
if not success and args.rollback_on_error:
deployer.rollback()
return 1
if report['failed'] == 0:
print(f"\nā
Deployment completed successfully!")
return 0
else:
print(f"\nā ļø Deployment completed with errors")
return 1
if __name__ == "__main__":
import sys
sys.exit(main())
```