Back to skills
SkillHub ClubAnalyze Data & AIFull StackData / AI

single-cell-rna-qc

Performs quality control on single-cell RNA-seq data (.h5ad or .h5 files) using scverse best practices with MAD-based filtering and comprehensive visualizations. Use when users request QC analysis, filtering low-quality cells, assessing data quality, or following scverse/scanpy best practices for single-cell analysis.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars
19
Hot score
87
Updated
March 20, 2026
Overall rating
C2.5
Composite score
2.5
Best-practice grade
C68.3

Install command

npx @skill-hub/cli install biocontext-ai-skill-to-mcp-single-cell-rna-qc

Repository

biocontext-ai/skill-to-mcp

Skill path: skills/single-cell-rna-qc

Performs quality control on single-cell RNA-seq data (.h5ad or .h5 files) using scverse best practices with MAD-based filtering and comprehensive visualizations. Use when users request QC analysis, filtering low-quality cells, assessing data quality, or following scverse/scanpy best practices for single-cell analysis.

Open repository

Best for

Primary workflow: Analyze Data & AI.

Technical facets: Full Stack, Data / AI.

Target audience: everyone.

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: biocontext-ai.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

  • Install single-cell-rna-qc into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
  • Review https://github.com/biocontext-ai/skill-to-mcp before adding single-cell-rna-qc to shared team environments
  • Use single-cell-rna-qc for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: single-cell-rna-qc
description: Performs quality control on single-cell RNA-seq data (.h5ad or .h5 files) using scverse best practices with MAD-based filtering and comprehensive visualizations. Use when users request QC analysis, filtering low-quality cells, assessing data quality, or following scverse/scanpy best practices for single-cell analysis.
---

# Single-Cell RNA-seq Quality Control

Automated QC workflow for single-cell RNA-seq data following scverse best practices.

> **Acknowledgment**: This skill is adapted from [Anthropic's Life Sciences repository](https://github.com/anthropics/life-sciences) and follows best practices from the [scverse® community](https://scverse.org).

## When to Use This Skill

Use when users:
- Request quality control or QC on single-cell RNA-seq data
- Want to filter low-quality cells or assess data quality
- Need QC visualizations or metrics
- Ask to follow scverse/scanpy best practices
- Request MAD-based filtering or outlier detection

**Supported input formats:**
- `.h5ad` files (AnnData format from scanpy/Python workflows)
- `.h5` files (10X Genomics Cell Ranger output)

**Default recommendation**: Use Approach 1 (complete pipeline) unless the user has specific custom requirements or explicitly requests non-standard filtering logic.

## Approach 1: Complete QC Pipeline (Recommended for Standard Workflows)

For standard QC following scverse best practices, use the convenience script `scripts/qc_analysis.py`:

```bash
python3 scripts/qc_analysis.py input.h5ad
# or for 10X Genomics .h5 files:
python3 scripts/qc_analysis.py raw_feature_bc_matrix.h5
```

The script automatically detects the file format and loads it appropriately.

**When to use this approach:**
- Standard QC workflow with adjustable thresholds (all cells filtered the same way)
- Batch processing multiple datasets
- Quick exploratory analysis
- User wants the "just works" solution

**Requirements:** anndata, scanpy, scipy, matplotlib, seaborn, numpy

**Parameters:**

Customize filtering thresholds and gene patterns using command-line parameters:
- `--output-dir` - Output directory
- `--mad-counts`, `--mad-genes`, `--mad-mt` - MAD thresholds for counts/genes/MT%
- `--mt-threshold` - Hard mitochondrial % cutoff
- `--min-cells` - Gene filtering threshold
- `--mt-pattern`, `--ribo-pattern`, `--hb-pattern` - Gene name patterns for different species

Use `--help` to see current default values.

**Outputs:**

All files are saved to `<input_basename>_qc_results/` directory by default (or to the directory specified by `--output-dir`):
- `qc_metrics_before_filtering.png` - Pre-filtering visualizations
- `qc_filtering_thresholds.png` - MAD-based threshold overlays
- `qc_metrics_after_filtering.png` - Post-filtering quality metrics
- `<input_basename>_filtered.h5ad` - Clean, filtered dataset ready for downstream analysis
- `<input_basename>_with_qc.h5ad` - Original data with QC annotations preserved

If copying outputs to `/mnt/user-data/outputs/` for user access, copy individual files (not the entire directory) so users can preview them directly as Claude.ai artifacts.

### Workflow Steps

The script performs the following steps:

1. **Calculate QC metrics** - Count depth, gene detection, mitochondrial/ribosomal/hemoglobin content
2. **Apply MAD-based filtering** - Permissive outlier detection using MAD thresholds for counts/genes/MT%
3. **Filter genes** - Remove genes detected in few cells
4. **Generate visualizations** - Comprehensive before/after plots with threshold overlays

## Approach 2: Modular Building Blocks (For Custom Workflows)

For custom analysis workflows or non-standard requirements, use the modular utility functions from `scripts/qc_core.py` and `scripts/qc_plotting.py`:

```python
# Run from scripts/ directory, or add scripts/ to sys.path if needed
import anndata as ad
from qc_core import calculate_qc_metrics, detect_outliers_mad, filter_cells
from qc_plotting import plot_qc_distributions  # Only if visualization needed

adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)
# ... custom analysis logic here
```

**When to use this approach:**
- Different workflow needed (skip steps, change order, apply different thresholds to subsets)
- Conditional logic (e.g., filter neurons differently than other cells)
- Partial execution (only metrics/visualization, no filtering)
- Integration with other analysis steps in a larger pipeline
- Custom filtering criteria beyond what command-line params support

**Available utility functions:**

From `qc_core.py` (core QC operations):
- `calculate_qc_metrics(adata, mt_pattern, ribo_pattern, hb_pattern, inplace=True)` - Calculate QC metrics and annotate adata
- `detect_outliers_mad(adata, metric, n_mads, verbose=True)` - MAD-based outlier detection, returns boolean mask
- `apply_hard_threshold(adata, metric, threshold, operator='>', verbose=True)` - Apply hard cutoffs, returns boolean mask
- `filter_cells(adata, mask, inplace=False)` - Apply boolean mask to filter cells
- `filter_genes(adata, min_cells=20, min_counts=None, inplace=True)` - Filter genes by detection
- `print_qc_summary(adata, label='')` - Print summary statistics

From `qc_plotting.py` (visualization):
- `plot_qc_distributions(adata, output_path, title)` - Generate comprehensive QC plots
- `plot_filtering_thresholds(adata, outlier_masks, thresholds, output_path)` - Visualize filtering thresholds
- `plot_qc_after_filtering(adata, output_path)` - Generate post-filtering plots

**Example custom workflows:**

**Example 1: Only calculate metrics and visualize, don't filter yet**
```python
adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)
plot_qc_distributions(adata, 'qc_before.png', title='Initial QC')
print_qc_summary(adata, label='Before filtering')
```

**Example 2: Apply only MT% filtering, keep other metrics permissive**
```python
adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)

# Only filter high MT% cells
high_mt = apply_hard_threshold(adata, 'pct_counts_mt', 10, operator='>')
adata_filtered = filter_cells(adata, ~high_mt)
adata_filtered.write('filtered.h5ad')
```

**Example 3: Different thresholds for different subsets**
```python
adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)

# Apply type-specific QC (assumes cell_type metadata exists)
neurons = adata.obs['cell_type'] == 'neuron'
other_cells = ~neurons

# Neurons tolerate higher MT%, other cells use stricter threshold
neuron_qc = apply_hard_threshold(adata[neurons], 'pct_counts_mt', 15, operator='>')
other_qc = apply_hard_threshold(adata[other_cells], 'pct_counts_mt', 8, operator='>')
```

## Best Practices

1. **Be permissive with filtering** - Default thresholds intentionally retain most cells to avoid losing rare populations
2. **Inspect visualizations** - Always review before/after plots to ensure filtering makes biological sense
3. **Consider dataset-specific factors** - Some tissues naturally have higher mitochondrial content (e.g., neurons, cardiomyocytes)
4. **Check gene annotations** - Mitochondrial gene prefixes vary by species (mt- for mouse, MT- for human)
5. **Iterate if needed** - QC parameters may need adjustment based on the specific experiment or tissue type

## Reference Materials

For detailed QC methodology, parameter rationale, and troubleshooting guidance, see `references/scverse_qc_guidelines.md`. This reference provides:
- Detailed explanations of each QC metric and why it matters
- Rationale for MAD-based thresholds and why they're better than fixed cutoffs
- Guidelines for interpreting QC visualizations (histograms, violin plots, scatter plots)
- Species-specific considerations for gene annotations
- When and how to adjust filtering parameters
- Advanced QC considerations (ambient RNA correction, doublet detection)

Load this reference when users need deeper understanding of the methodology or when troubleshooting QC issues.

## Next Steps After QC

Typical downstream analysis steps:
- Ambient RNA correction (SoupX, CellBender)
- Doublet detection (scDblFinder)
- Normalization (log-normalize, scran)
- Feature selection and dimensionality reduction
- Clustering and cell type annotation


---

## Referenced Files

> The following files are referenced in this skill and included for context.

### scripts/qc_analysis.py

```python
#!/usr/bin/env python3
"""
Quality Control Analysis for Single-Cell RNA-seq Data
Following scverse best practices from:
https://www.sc-best-practices.org/preprocessing_visualization/quality_control.html

This is a convenience script that runs a complete QC workflow using the
modular functions from qc_core.py and qc_plotting.py.
"""

import anndata as ad
import scanpy as sc
import sys
import os
import argparse

# Import our modular utilities
from qc_core import (
    calculate_qc_metrics,
    detect_outliers_mad,
    apply_hard_threshold,
    filter_cells,
    filter_genes,
    print_qc_summary
)
from qc_plotting import (
    plot_qc_distributions,
    plot_filtering_thresholds,
    plot_qc_after_filtering
)

print("=" * 80)
print("Single-Cell RNA-seq Quality Control Analysis")
print("=" * 80)

# Default parameters (single source of truth)
DEFAULT_MAD_COUNTS = 5
DEFAULT_MAD_GENES = 5
DEFAULT_MAD_MT = 3
DEFAULT_MT_THRESHOLD = 8
DEFAULT_MIN_CELLS = 20
DEFAULT_MT_PATTERN = 'mt-,MT-'
DEFAULT_RIBO_PATTERN = 'Rpl,Rps,RPL,RPS'
DEFAULT_HB_PATTERN = '^Hb[^(p)]|^HB[^(P)]'

# Parse command-line arguments
parser = argparse.ArgumentParser(
    description='Quality Control Analysis for Single-Cell RNA-seq Data',
    formatter_class=argparse.RawDescriptionHelpFormatter,
    epilog="""
Examples:
  python3 qc_analysis.py data.h5ad
  python3 qc_analysis.py raw_feature_bc_matrix.h5
  python3 qc_analysis.py data.h5ad --mad-counts 4 --mad-genes 4 --mad-mt 2.5
  python3 qc_analysis.py data.h5ad --mt-threshold 10 --min-cells 10
  python3 qc_analysis.py data.h5ad --mt-pattern "^mt-" --ribo-pattern "^Rpl,^Rps"
    """
)

parser.add_argument('input_file', help='Input .h5ad or .h5 file (10X Genomics format)')
parser.add_argument('--output-dir', type=str, help='Output directory (default: <input_basename>_qc_results)')
parser.add_argument('--mad-counts', type=float, default=DEFAULT_MAD_COUNTS, help=f'MAD threshold for total counts (default: {DEFAULT_MAD_COUNTS})')
parser.add_argument('--mad-genes', type=float, default=DEFAULT_MAD_GENES, help=f'MAD threshold for gene counts (default: {DEFAULT_MAD_GENES})')
parser.add_argument('--mad-mt', type=float, default=DEFAULT_MAD_MT, help=f'MAD threshold for mitochondrial percentage (default: {DEFAULT_MAD_MT})')
parser.add_argument('--mt-threshold', type=float, default=DEFAULT_MT_THRESHOLD, help=f'Hard threshold for mitochondrial percentage (default: {DEFAULT_MT_THRESHOLD})')
parser.add_argument('--min-cells', type=int, default=DEFAULT_MIN_CELLS, help=f'Minimum cells for gene filtering (default: {DEFAULT_MIN_CELLS})')
parser.add_argument('--mt-pattern', type=str, default=DEFAULT_MT_PATTERN, help=f'Comma-separated mitochondrial gene prefixes (default: "{DEFAULT_MT_PATTERN}")')
parser.add_argument('--ribo-pattern', type=str, default=DEFAULT_RIBO_PATTERN, help=f'Comma-separated ribosomal gene prefixes (default: "{DEFAULT_RIBO_PATTERN}")')
parser.add_argument('--hb-pattern', type=str, default=DEFAULT_HB_PATTERN, help=f'Hemoglobin gene regex pattern (default: "{DEFAULT_HB_PATTERN}")')

args = parser.parse_args()

# Verify input file exists
if not os.path.exists(args.input_file):
    print(f"\nError: File '{args.input_file}' not found!")
    sys.exit(1)

input_file = args.input_file
base_name = os.path.splitext(os.path.basename(input_file))[0]

# Set up output directory
if args.output_dir:
    output_dir = args.output_dir
else:
    output_dir = f"{base_name}_qc_results"

os.makedirs(output_dir, exist_ok=True)
print(f"\nOutput directory: {output_dir}")

# Display parameters
print(f"\nParameters:")
print(f"  MAD thresholds: counts={args.mad_counts}, genes={args.mad_genes}, MT%={args.mad_mt}")
print(f"  MT hard threshold: {args.mt_threshold}%")
print(f"  Min cells for gene filtering: {args.min_cells}")
print(f"  Gene patterns: MT={args.mt_pattern}, Ribo={args.ribo_pattern}")

# Load the data
print("\n[1/5] Loading data...")
file_ext = os.path.splitext(input_file)[1].lower()

if file_ext == '.h5ad':
    adata = ad.read_h5ad(input_file)
    print(f"Loaded .h5ad file: {adata.n_obs} cells × {adata.n_vars} genes")
elif file_ext == '.h5':
    adata = sc.read_10x_h5(input_file)
    print(f"Loaded 10X .h5 file: {adata.n_obs} cells × {adata.n_vars} genes")
    # Make variable names unique (10X data sometimes has duplicate gene names)
    adata.var_names_make_unique()
else:
    print(f"\nError: Unsupported file format '{file_ext}'. Expected .h5ad or .h5")
    sys.exit(1)

# Store original counts for comparison
n_cells_original = adata.n_obs
n_genes_original = adata.n_vars

# Calculate QC metrics
print("\n[2/5] Calculating QC metrics...")
calculate_qc_metrics(adata, mt_pattern=args.mt_pattern,
                     ribo_pattern=args.ribo_pattern,
                     hb_pattern=args.hb_pattern,
                     inplace=True)

print(f"  Found {adata.var['mt'].sum()} mitochondrial genes (pattern: {args.mt_pattern})")
print(f"  Found {adata.var['ribo'].sum()} ribosomal genes (pattern: {args.ribo_pattern})")
print(f"  Found {adata.var['hb'].sum()} hemoglobin genes (pattern: {args.hb_pattern})")

print_qc_summary(adata, label='QC Metrics Summary (before filtering)')

# Create before-filtering visualizations
print("\n[3/5] Creating QC visualizations...")
before_plot = os.path.join(output_dir, 'qc_metrics_before_filtering.png')
plot_qc_distributions(adata, before_plot, title='Quality Control Metrics - Before Filtering')
print(f"  Saved: {before_plot}")

# Apply MAD-based filtering
print("\n[4/5] Applying MAD-based filtering thresholds...")

# Detect outliers for each metric
adata.obs['outlier_counts'] = detect_outliers_mad(adata, 'total_counts', args.mad_counts)
adata.obs['outlier_genes'] = detect_outliers_mad(adata, 'n_genes_by_counts', args.mad_genes)
adata.obs['outlier_mt'] = detect_outliers_mad(adata, 'pct_counts_mt', args.mad_mt)

# Apply hard threshold for mitochondrial content
print(f"\n  Applying hard threshold for mitochondrial content (>{args.mt_threshold}%):")
high_mt_mask = apply_hard_threshold(adata, 'pct_counts_mt', args.mt_threshold, operator='>')

# Combine MT filters (MAD + hard threshold)
adata.obs['outlier_mt'] = adata.obs['outlier_mt'] | high_mt_mask

# Overall filtering decision
adata.obs['pass_qc'] = ~(
    adata.obs['outlier_counts'] |
    adata.obs['outlier_genes'] |
    adata.obs['outlier_mt']
)

print(f"\n  Total cells failing QC: {(~adata.obs['pass_qc']).sum()} ({(~adata.obs['pass_qc']).sum()/adata.n_obs*100:.2f}%)")
print(f"  Cells passing QC: {adata.obs['pass_qc'].sum()} ({adata.obs['pass_qc'].sum()/adata.n_obs*100:.2f}%)")

# Visualize filtering thresholds
outlier_masks = {
    'total_counts': adata.obs['outlier_counts'].values,
    'n_genes_by_counts': adata.obs['outlier_genes'].values,
    'pct_counts_mt': adata.obs['outlier_mt'].values
}

thresholds = {
    'total_counts': {'n_mads': args.mad_counts},
    'n_genes_by_counts': {'n_mads': args.mad_genes},
    'pct_counts_mt': {'n_mads': args.mad_mt, 'hard': args.mt_threshold}
}

threshold_plot = os.path.join(output_dir, 'qc_filtering_thresholds.png')
plot_filtering_thresholds(adata, outlier_masks, thresholds, threshold_plot)
print(f"\n  Saved: {threshold_plot}")

# Apply filtering
print("\n[5/5] Applying filters...")
adata_filtered = filter_cells(adata, adata.obs['pass_qc'].values, inplace=False)
print(f"  Cells after filtering: {adata_filtered.n_obs} (removed {n_cells_original - adata_filtered.n_obs})")

# Filter genes
print(f"\n  Filtering genes detected in <{args.min_cells} cells...")
filter_genes(adata_filtered, min_cells=args.min_cells, inplace=True)
print(f"  Genes after filtering: {adata_filtered.n_vars} (removed {n_genes_original - adata_filtered.n_vars})")

# Generate summary statistics
print("\n" + "=" * 80)
print("QC Summary")
print("=" * 80)

print("\nBefore filtering:")
print(f"  Cells: {n_cells_original}")
print(f"  Genes: {n_genes_original}")

print("\nAfter filtering:")
print(f"  Cells: {adata_filtered.n_obs} ({adata_filtered.n_obs/n_cells_original*100:.1f}% retained)")
print(f"  Genes: {adata_filtered.n_vars} ({adata_filtered.n_vars/n_genes_original*100:.1f}% retained)")

print_qc_summary(adata_filtered, label='\nFiltered data QC metrics')

# Create after-filtering visualizations
after_plot = os.path.join(output_dir, 'qc_metrics_after_filtering.png')
plot_qc_after_filtering(adata_filtered, after_plot)
print(f"\n  Saved: {after_plot}")

# Save filtered data
print("\nSaving filtered data...")
output_filtered = os.path.join(output_dir, f'{base_name}_filtered.h5ad')
output_with_qc = os.path.join(output_dir, f'{base_name}_with_qc.h5ad')
adata_filtered.write(output_filtered)
print(f"  Saved: {output_filtered}")

# Also save the unfiltered data with QC annotations
adata.write(output_with_qc)
print(f"  Saved: {output_with_qc} (original data with QC annotations)")

print("\n" + "=" * 80)
print("Quality Control Analysis Complete!")
print("=" * 80)
print(f"\nAll results saved to: {output_dir}/")
print("\nGenerated files:")
print("  1. qc_metrics_before_filtering.png - Initial QC visualizations")
print("  2. qc_filtering_thresholds.png - MAD-based threshold visualization")
print("  3. qc_metrics_after_filtering.png - Post-filtering QC visualizations")
print(f"  4. {base_name}_filtered.h5ad - Filtered dataset")
print(f"  5. {base_name}_with_qc.h5ad - Original dataset with QC annotations")
print("\nNext steps:")
print("  - Consider ambient RNA correction (SoupX)")
print("  - Consider doublet detection (scDblFinder)")
print("  - Proceed with normalization and downstream analysis")

```

### scripts/qc_core.py

```python
#!/usr/bin/env python3
"""
Core utility functions for single-cell RNA-seq quality control.

This module provides building blocks for metrics calculation and filtering
while following scverse best practices from:
https://www.sc-best-practices.org/preprocessing_visualization/quality_control.html
"""

import anndata as ad
import scanpy as sc
import numpy as np
from scipy.stats import median_abs_deviation


def calculate_qc_metrics(adata, mt_pattern='mt-,MT-', ribo_pattern='Rpl,Rps,RPL,RPS',
                        hb_pattern='^Hb[^(p)]|^HB[^(P)]', inplace=True):
    """
    Calculate QC metrics for single-cell RNA-seq data.

    Parameters
    ----------
    adata : AnnData
        Annotated data matrix
    mt_pattern : str
        Comma-separated mitochondrial gene prefixes (default: 'mt-,MT-')
    ribo_pattern : str
        Comma-separated ribosomal gene prefixes (default: 'Rpl,Rps,RPL,RPS')
    hb_pattern : str
        Regex pattern for hemoglobin genes (default: '^Hb[^(p)]|^HB[^(P)]')
    inplace : bool
        Modify adata in place (default: True)

    Returns
    -------
    AnnData or None
        If inplace=False, returns modified AnnData. Otherwise modifies in place.
    """
    if not inplace:
        adata = adata.copy()

    # Identify gene categories
    mt_prefixes = tuple(mt_pattern.split(','))
    adata.var['mt'] = adata.var_names.str.startswith(mt_prefixes)

    ribo_prefixes = tuple(ribo_pattern.split(','))
    adata.var['ribo'] = adata.var_names.str.startswith(ribo_prefixes)

    adata.var['hb'] = adata.var_names.str.match(hb_pattern)

    # Calculate QC metrics
    sc.pp.calculate_qc_metrics(
        adata,
        qc_vars=['mt', 'ribo', 'hb'],
        percent_top=None,
        log1p=False,
        inplace=True
    )

    if not inplace:
        return adata


def detect_outliers_mad(adata, metric, n_mads, verbose=True):
    """
    Detect outliers using Median Absolute Deviation (MAD).

    Parameters
    ----------
    adata : AnnData
        Annotated data matrix with QC metrics
    metric : str
        Column name in adata.obs to use for outlier detection
    n_mads : float
        Number of MADs to use as threshold
    verbose : bool
        Print outlier statistics (default: True)

    Returns
    -------
    np.ndarray
        Boolean mask where True indicates outliers
    """
    metric_values = adata.obs[metric]
    median = np.median(metric_values)
    mad = median_abs_deviation(metric_values)

    # Calculate bounds
    lower = median - n_mads * mad
    upper = median + n_mads * mad

    # Identify outliers
    outlier_mask = (metric_values < lower) | (metric_values > upper)

    if verbose:
        print(f"  {metric}:")
        print(f"    Median: {median:.2f}, MAD: {mad:.2f}")
        print(f"    Bounds: [{lower:.2f}, {upper:.2f}] ({n_mads} MADs)")
        print(f"    Outliers: {outlier_mask.sum()} cells ({outlier_mask.sum()/len(metric_values)*100:.2f}%)")

    return outlier_mask


def apply_hard_threshold(adata, metric, threshold, operator='>', verbose=True):
    """
    Apply a hard threshold filter.

    Parameters
    ----------
    adata : AnnData
        Annotated data matrix
    metric : str
        Column name in adata.obs to filter on
    threshold : float
        Threshold value
    operator : str
        Comparison operator: '>', '<', '>=', '<=' (default: '>')
    verbose : bool
        Print filtering statistics (default: True)

    Returns
    -------
    np.ndarray
        Boolean mask where True indicates cells to filter out
    """
    metric_values = adata.obs[metric]

    if operator == '>':
        mask = metric_values > threshold
    elif operator == '<':
        mask = metric_values < threshold
    elif operator == '>=':
        mask = metric_values >= threshold
    elif operator == '<=':
        mask = metric_values <= threshold
    else:
        raise ValueError(f"Invalid operator: {operator}. Use '>', '<', '>=', or '<='")

    if verbose:
        print(f"  {metric} {operator} {threshold}:")
        print(f"    Cells filtered: {mask.sum()} ({mask.sum()/len(metric_values)*100:.2f}%)")

    return mask


def filter_cells(adata, mask, inplace=False):
    """
    Filter cells based on a boolean mask.

    Parameters
    ----------
    adata : AnnData
        Annotated data matrix
    mask : np.ndarray or pd.Series
        Boolean mask where True indicates cells to KEEP
    inplace : bool
        Modify adata in place (default: False)

    Returns
    -------
    AnnData
        Filtered AnnData object
    """
    if inplace:
        # This is actually a bit tricky - AnnData doesn't support true inplace filtering
        # Return filtered copy which caller should reassign
        return adata[mask].copy()
    else:
        return adata[mask].copy()


def filter_genes(adata, min_cells=20, min_counts=None, inplace=True):
    """
    Filter genes based on detection thresholds.

    Parameters
    ----------
    adata : AnnData
        Annotated data matrix
    min_cells : int
        Minimum number of cells a gene must be detected in (default: 20)
    min_counts : int, optional
        Minimum total counts across all cells
    inplace : bool
        Modify adata in place (default: True)

    Returns
    -------
    AnnData or None
        If inplace=False, returns filtered AnnData
    """
    if not inplace:
        adata = adata.copy()

    if min_cells is not None:
        sc.pp.filter_genes(adata, min_cells=min_cells)

    if min_counts is not None:
        sc.pp.filter_genes(adata, min_counts=min_counts)

    if not inplace:
        return adata


def print_qc_summary(adata, label=''):
    """
    Print summary statistics for QC metrics.

    Parameters
    ----------
    adata : AnnData
        Annotated data matrix with QC metrics
    label : str
        Label to prepend to output (e.g., 'Before filtering', 'After filtering')
    """
    if label:
        print(f"\n{label}:")
    print(f"  Cells: {adata.n_obs}")
    print(f"  Genes: {adata.n_vars}")

    if 'total_counts' in adata.obs:
        print(f"  Mean counts per cell: {adata.obs['total_counts'].mean():.0f}")
        print(f"  Median counts per cell: {adata.obs['total_counts'].median():.0f}")

    if 'n_genes_by_counts' in adata.obs:
        print(f"  Mean genes per cell: {adata.obs['n_genes_by_counts'].mean():.0f}")
        print(f"  Median genes per cell: {adata.obs['n_genes_by_counts'].median():.0f}")

    if 'pct_counts_mt' in adata.obs:
        print(f"  Mean mitochondrial %: {adata.obs['pct_counts_mt'].mean():.2f}%")

    if 'pct_counts_ribo' in adata.obs:
        print(f"  Mean ribosomal %: {adata.obs['pct_counts_ribo'].mean():.2f}%")

```

### scripts/qc_plotting.py

```python
#!/usr/bin/env python3
"""
Visualization functions for single-cell RNA-seq quality control.

This module provides plotting utilities for QC metrics and filtering thresholds.
"""

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import median_abs_deviation


def plot_qc_distributions(adata, output_path, title='Quality Control Metrics'):
    """
    Create comprehensive QC distribution plots.

    Parameters
    ----------
    adata : AnnData
        Annotated data matrix with QC metrics
    output_path : str
        Path to save the figure
    title : str
        Figure title (default: 'Quality Control Metrics')
    """
    fig, axes = plt.subplots(3, 3, figsize=(15, 12))
    fig.suptitle(title, fontsize=16, y=0.995)

    # Row 1: Histograms
    axes[0, 0].hist(adata.obs['total_counts'], bins=100, color='steelblue', edgecolor='black')
    axes[0, 0].set_xlabel('Total counts per cell')
    axes[0, 0].set_ylabel('Number of cells')
    axes[0, 0].set_title('Distribution of Total Counts')
    axes[0, 0].axvline(adata.obs['total_counts'].median(), color='red', linestyle='--', label='Median')
    axes[0, 0].legend()

    axes[0, 1].hist(adata.obs['n_genes_by_counts'], bins=100, color='forestgreen', edgecolor='black')
    axes[0, 1].set_xlabel('Genes per cell')
    axes[0, 1].set_ylabel('Number of cells')
    axes[0, 1].set_title('Distribution of Detected Genes')
    axes[0, 1].axvline(adata.obs['n_genes_by_counts'].median(), color='red', linestyle='--', label='Median')
    axes[0, 1].legend()

    axes[0, 2].hist(adata.obs['pct_counts_mt'], bins=100, color='coral', edgecolor='black')
    axes[0, 2].set_xlabel('Mitochondrial %')
    axes[0, 2].set_ylabel('Number of cells')
    axes[0, 2].set_title('Distribution of Mitochondrial Content')
    axes[0, 2].axvline(adata.obs['pct_counts_mt'].median(), color='red', linestyle='--', label='Median')
    axes[0, 2].legend()

    # Row 2: Violin plots
    axes[1, 0].violinplot([adata.obs['total_counts']], positions=[0], showmeans=True, showmedians=True)
    axes[1, 0].set_ylabel('Total counts')
    axes[1, 0].set_title('Total Counts per Cell')
    axes[1, 0].set_xticks([])

    axes[1, 1].violinplot([adata.obs['n_genes_by_counts']], positions=[0], showmeans=True, showmedians=True)
    axes[1, 1].set_ylabel('Genes detected')
    axes[1, 1].set_title('Genes per Cell')
    axes[1, 1].set_xticks([])

    axes[1, 2].violinplot([adata.obs['pct_counts_mt']], positions=[0], showmeans=True, showmedians=True)
    axes[1, 2].set_ylabel('Mitochondrial %')
    axes[1, 2].set_title('Mitochondrial Content')
    axes[1, 2].set_xticks([])

    # Row 3: Scatter plots
    scatter1 = axes[2, 0].scatter(
        adata.obs['total_counts'],
        adata.obs['n_genes_by_counts'],
        c=adata.obs['pct_counts_mt'],
        cmap='viridis',
        alpha=0.5,
        s=10
    )
    axes[2, 0].set_xlabel('Total counts')
    axes[2, 0].set_ylabel('Genes detected')
    axes[2, 0].set_title('Counts vs Genes (colored by MT%)')
    plt.colorbar(scatter1, ax=axes[2, 0], label='MT %')

    axes[2, 1].scatter(
        adata.obs['total_counts'],
        adata.obs['pct_counts_mt'],
        alpha=0.5,
        s=10,
        color='coral'
    )
    axes[2, 1].set_xlabel('Total counts')
    axes[2, 1].set_ylabel('Mitochondrial %')
    axes[2, 1].set_title('Total Counts vs Mitochondrial %')

    axes[2, 2].scatter(
        adata.obs['n_genes_by_counts'],
        adata.obs['pct_counts_mt'],
        alpha=0.5,
        s=10,
        color='forestgreen'
    )
    axes[2, 2].set_xlabel('Genes detected')
    axes[2, 2].set_ylabel('Mitochondrial %')
    axes[2, 2].set_title('Genes vs Mitochondrial %')

    plt.tight_layout()
    plt.savefig(output_path, dpi=300, bbox_inches='tight')
    plt.close()


def plot_filtering_thresholds(adata, outlier_masks, thresholds, output_path):
    """
    Visualize filtering thresholds overlaid on distributions.

    Parameters
    ----------
    adata : AnnData
        Annotated data matrix with QC metrics
    outlier_masks : dict
        Dictionary mapping metric names to boolean outlier masks
        Example: {'total_counts': mask1, 'n_genes_by_counts': mask2, 'pct_counts_mt': mask3}
    thresholds : dict
        Dictionary with threshold information for each metric
        Example: {'total_counts': {'n_mads': 5}, 'pct_counts_mt': {'n_mads': 3, 'hard': 8}}
    output_path : str
        Path to save the figure
    """
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    fig.suptitle('MAD-Based Filtering Thresholds', fontsize=16)

    # Helper function to plot with thresholds
    def plot_with_threshold(ax, metric, outlier_mask, n_mads, hard_threshold=None):
        data = adata.obs[metric]
        median = np.median(data)
        mad = median_abs_deviation(data)
        lower = median - n_mads * mad
        upper = median + n_mads * mad

        ax.hist(data[~outlier_mask], bins=100, alpha=0.7, label='Pass QC', color='steelblue')
        ax.hist(data[outlier_mask], bins=100, alpha=0.7, label='Fail QC', color='coral')
        ax.axvline(lower, color='red', linestyle='--', linewidth=2, label=f'Thresholds ({n_mads} MADs)')
        ax.axvline(upper, color='red', linestyle='--', linewidth=2)

        if hard_threshold is not None:
            ax.axvline(hard_threshold, color='darkred', linestyle=':', linewidth=2,
                      label=f'Hard threshold ({hard_threshold})')

        ax.set_xlabel(metric.replace('_', ' ').title())
        ax.set_ylabel('Number of cells')
        ax.legend()

    # Plot each metric
    metrics = [
        ('total_counts', 'Total Counts'),
        ('n_genes_by_counts', 'Genes Detected'),
        ('pct_counts_mt', 'Mitochondrial %')
    ]

    for idx, (metric, label) in enumerate(metrics):
        if metric in outlier_masks and metric in thresholds:
            hard = thresholds[metric].get('hard', None)
            plot_with_threshold(axes[idx], metric, outlier_masks[metric],
                              thresholds[metric]['n_mads'], hard)

    plt.tight_layout()
    plt.savefig(output_path, dpi=300, bbox_inches='tight')
    plt.close()


def plot_qc_after_filtering(adata, output_path):
    """
    Create QC plots for filtered data (simplified version without outlier overlay).

    Parameters
    ----------
    adata : AnnData
        Filtered annotated data matrix with QC metrics
    output_path : str
        Path to save the figure
    """
    fig, axes = plt.subplots(2, 3, figsize=(15, 8))
    fig.suptitle('Quality Control Metrics - After Filtering', fontsize=16, y=0.995)

    # Row 1: Histograms
    axes[0, 0].hist(adata.obs['total_counts'], bins=100, color='steelblue', edgecolor='black')
    axes[0, 0].set_xlabel('Total counts per cell')
    axes[0, 0].set_ylabel('Number of cells')
    axes[0, 0].set_title('Distribution of Total Counts')

    axes[0, 1].hist(adata.obs['n_genes_by_counts'], bins=100, color='forestgreen', edgecolor='black')
    axes[0, 1].set_xlabel('Genes per cell')
    axes[0, 1].set_ylabel('Number of cells')
    axes[0, 1].set_title('Distribution of Detected Genes')

    axes[0, 2].hist(adata.obs['pct_counts_mt'], bins=100, color='coral', edgecolor='black')
    axes[0, 2].set_xlabel('Mitochondrial %')
    axes[0, 2].set_ylabel('Number of cells')
    axes[0, 2].set_title('Distribution of Mitochondrial Content')

    # Row 2: Scatter plots
    scatter1 = axes[1, 0].scatter(
        adata.obs['total_counts'],
        adata.obs['n_genes_by_counts'],
        c=adata.obs['pct_counts_mt'],
        cmap='viridis',
        alpha=0.5,
        s=10
    )
    axes[1, 0].set_xlabel('Total counts')
    axes[1, 0].set_ylabel('Genes detected')
    axes[1, 0].set_title('Counts vs Genes (colored by MT%)')
    plt.colorbar(scatter1, ax=axes[1, 0], label='MT %')

    axes[1, 1].scatter(
        adata.obs['total_counts'],
        adata.obs['pct_counts_mt'],
        alpha=0.5,
        s=10,
        color='coral'
    )
    axes[1, 1].set_xlabel('Total counts')
    axes[1, 1].set_ylabel('Mitochondrial %')
    axes[1, 1].set_title('Total Counts vs Mitochondrial %')

    axes[1, 2].scatter(
        adata.obs['n_genes_by_counts'],
        adata.obs['pct_counts_mt'],
        alpha=0.5,
        s=10,
        color='forestgreen'
    )
    axes[1, 2].set_xlabel('Genes detected')
    axes[1, 2].set_ylabel('Mitochondrial %')
    axes[1, 2].set_title('Genes vs Mitochondrial %')

    plt.tight_layout()
    plt.savefig(output_path, dpi=300, bbox_inches='tight')
    plt.close()

```

### references/scverse_qc_guidelines.md

```markdown
# scverse Quality Control Guidelines

This document provides detailed information about quality control best practices for single-cell RNA-seq data, following the scverse ecosystem recommendations.

## Quality Control Metrics

### Count Depth (Total Counts)
- **What it measures**: Total number of UMI/reads per cell
- **Why it matters**: Low count cells may be empty droplets, debris, or poorly captured cells
- **Typical range**: 500-50,000 counts per cell (varies by protocol)
- **Red flags**: Bimodal distributions may indicate mixing of high and low-quality cells

### Gene Detection (Genes per Cell)
- **What it measures**: Number of genes with at least 1 count
- **Why it matters**: Strongly correlates with count depth; low values indicate poor capture
- **Typical range**: 200-5,000 genes per cell
- **Red flags**: Very low values (<200) suggest technical failures

### Mitochondrial Content
- **What it measures**: Percentage of counts from mitochondrial genes
- **Why it matters**: High MT% indicates cell stress, apoptosis, or lysed cells
- **Typical range**: <5% for most tissues, up to 10-15% for metabolically active cells
- **Species-specific patterns**:
  - Mouse: Genes start with 'mt-' (e.g., mt-Nd1, mt-Co1)
  - Human: Genes start with 'MT-' (e.g., MT-ND1, MT-CO1)
- **Context matters**: Some cell types (cardiomyocytes, neurons) naturally have higher MT content

### Ribosomal Content
- **What it measures**: Percentage of counts from ribosomal protein genes
- **Why it matters**: Can indicate cell state or contamination
- **Patterns**: Genes start with 'Rpl'/'RPL' (large subunit) or 'Rps'/'RPS' (small subunit)
- **Note**: High ribosomal content isn't always bad - metabolically active cells have more ribosomes

### Hemoglobin Content
- **What it measures**: Percentage of counts from hemoglobin genes
- **Why it matters**: Indicates blood contamination in non-blood tissues
- **Patterns**: Genes matching '^Hb[^(p)]' or '^HB[^(P)]' (excludes Hbp1/HBP1)
- **When to use**: Particularly important for tissue samples (brain, liver, etc.)

## MAD-Based Filtering Rationale

### Why MAD Instead of Fixed Thresholds?

Fixed thresholds (e.g., "remove cells with <500 genes") fail because:
- Different protocols yield different ranges
- Different tissues have different characteristics
- Different species have different gene counts
- Fixed thresholds are arbitrary and not data-driven

MAD (Median Absolute Deviation) is robust to outliers and adapts to your dataset:
```
MAD = median(|X - median(X)|)
Outlier bounds = median ± n_MADs × MAD
```

### Recommended MAD Thresholds

Following scverse best practices (deliberately permissive):

**5 MADs for count depth (log-transformed)**
- Very permissive to retain rare cell populations
- Catches extreme outliers (empty droplets, debris)
- Log transformation handles the typical right-skewed distribution

**5 MADs for gene counts (log-transformed)**
- Parallels count depth filtering
- Most informative when combined with count filtering
- Log transformation normalizes the distribution

**3 MADs for mitochondrial percentage**
- More stringent because high MT% strongly indicates dying cells
- Uses raw percentages (not log-transformed)
- Combined with hard threshold for extra stringency

**Hard threshold: 8% mitochondrial content**
- Additional filter beyond MAD-based detection
- Conservative cutoff that works across most tissues
- Adjust higher (10-15%) for metabolically active cell types

### Why Be Permissive?

The default thresholds intentionally err on the side of keeping cells because:
1. **Rare populations**: Stringent filtering may remove rare but viable cell types
2. **Biological variation**: Some healthy cells naturally have extreme values
3. **Reversibility**: Easier to filter more later than to recover lost cells
4. **Downstream robustness**: Modern normalization methods handle moderate quality variation

## Interpreting QC Visualizations

### Histograms
- **Bimodal distributions**: May indicate mixing of cell types or quality issues
- **Long tails**: Common for count depth; MAD filtering handles this
- **Sharp cutoffs**: May indicate prior filtering or technical artifacts

### Violin Plots
- Shows distribution shape and density
- Median (line) and mean (diamond) should be similar for symmetric distributions
- Wide distributions suggest high heterogeneity

### Scatter Plots

**Counts vs Genes (colored by MT%)**
- Should show strong positive correlation (R² > 0.8 typical)
- Points deviating from trend may be outliers
- High MT% cells often cluster at low counts/genes

**Counts vs MT%**
- Negative correlation expected (dying cells have fewer counts)
- Vertical stratification may indicate batch effects
- Cells with high counts + high MT% are suspicious

**Genes vs MT%**
- Similar to counts vs MT%, but often weaker correlation
- Useful for identifying cells with gene detection issues

## Gene Filtering

After filtering cells, remove genes detected in fewer than 20 cells:
- **Why 20?**: Balances noise reduction with information retention
- **Benefits**: Reduces dataset size, speeds up computation, removes noisy genes
- **Trade-offs**: May lose very rare markers; adjust to 10 if studying rare populations

## Species-Specific Considerations

### Mouse (Mus musculus)
- Mitochondrial genes: mt-* (lowercase)
- Ribosomal genes: Rpl*, Rps* (capitalized first letter)
- Hemoglobin genes: Hb* (but not Hbp1)

### Human (Homo sapiens)
- Mitochondrial genes: MT-* (uppercase)
- Ribosomal genes: RPL*, RPS* (all uppercase)
- Hemoglobin genes: HB* (but not HBP1)

### Other Species
Adjust gene name patterns in the script to match your organism's gene nomenclature. Consult Ensembl or your reference annotation for correct prefixes.

## When to Adjust Parameters

Consider adjusting filtering thresholds when:

**More stringent (lower MADs)**
- High ambient RNA contamination suspected
- Many low-quality cells observed in visualizations
- Downstream analysis shows quality-driven clustering

**More permissive (higher MADs)**
- Studying rare cell populations
- Dataset has high technical quality
- Cell types naturally have extreme values (e.g., neurons with high MT%)

**Tissue-specific adjustments**
- Brain/neurons: May need higher MT% threshold (10-15%)
- Blood: Can be more stringent with MT% (5-8%)
- Tumor samples: Often need more permissive thresholds due to biological variation

## Advanced QC Considerations

### Not Included in This Workflow

**Ambient RNA correction**
- Tool: SoupX, CellBender, DecontX
- When: High background RNA in droplet-based data
- Effect: Removes contamination from lysed cells

**Doublet detection**
- Tool: scDblFinder, scrublet, DoubletFinder
- When: Always recommended for droplet-based data
- Effect: Identifies and removes multiplets (2+ cells in one droplet)

**Cell cycle scoring**
- Tool: scanpy's score_genes_cell_cycle
- When: Cell cycle effects confound biological signal
- Effect: Allows regressing out or accounting for cell cycle phase

**Batch correction**
- Tool: Harmony, scVI, ComBat
- When: Integrating data from multiple batches/experiments
- Effect: Removes technical batch effects while preserving biology

## References

- scverse Best Practices: https://www.sc-best-practices.org/preprocessing_visualization/quality_control.html
- Luecken & Theis (2019): Current best practices in single-cell RNA-seq analysis
- Osorio & Cai (2021): Systematic determination of the mitochondrial proportion in human and mouse genomes
- Germain et al. (2020): Doublet identification in single-cell sequencing data using scDblFinder

```

single-cell-rna-qc | SkillHub