single-cell-rna-qc
Performs quality control on single-cell RNA-seq data (.h5ad or .h5 files) using scverse best practices with MAD-based filtering and comprehensive visualizations. Use when users request QC analysis, filtering low-quality cells, assessing data quality, or following scverse/scanpy best practices for single-cell analysis.
Packaged view
This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.
Install command
npx @skill-hub/cli install biocontext-ai-skill-to-mcp-single-cell-rna-qc
Repository
Skill path: skills/single-cell-rna-qc
Performs quality control on single-cell RNA-seq data (.h5ad or .h5 files) using scverse best practices with MAD-based filtering and comprehensive visualizations. Use when users request QC analysis, filtering low-quality cells, assessing data quality, or following scverse/scanpy best practices for single-cell analysis.
Open repositoryBest for
Primary workflow: Analyze Data & AI.
Technical facets: Full Stack, Data / AI.
Target audience: everyone.
License: Unknown.
Original source
Catalog source: SkillHub Club.
Repository owner: biocontext-ai.
This is still a mirrored public skill entry. Review the repository before installing into production workflows.
What it helps with
- Install single-cell-rna-qc into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
- Review https://github.com/biocontext-ai/skill-to-mcp before adding single-cell-rna-qc to shared team environments
- Use single-cell-rna-qc for development workflows
Works across
Favorites: 0.
Sub-skills: 0.
Aggregator: No.
Original source / Raw SKILL.md
---
name: single-cell-rna-qc
description: Performs quality control on single-cell RNA-seq data (.h5ad or .h5 files) using scverse best practices with MAD-based filtering and comprehensive visualizations. Use when users request QC analysis, filtering low-quality cells, assessing data quality, or following scverse/scanpy best practices for single-cell analysis.
---
# Single-Cell RNA-seq Quality Control
Automated QC workflow for single-cell RNA-seq data following scverse best practices.
> **Acknowledgment**: This skill is adapted from [Anthropic's Life Sciences repository](https://github.com/anthropics/life-sciences) and follows best practices from the [scverse® community](https://scverse.org).
## When to Use This Skill
Use when users:
- Request quality control or QC on single-cell RNA-seq data
- Want to filter low-quality cells or assess data quality
- Need QC visualizations or metrics
- Ask to follow scverse/scanpy best practices
- Request MAD-based filtering or outlier detection
**Supported input formats:**
- `.h5ad` files (AnnData format from scanpy/Python workflows)
- `.h5` files (10X Genomics Cell Ranger output)
**Default recommendation**: Use Approach 1 (complete pipeline) unless the user has specific custom requirements or explicitly requests non-standard filtering logic.
## Approach 1: Complete QC Pipeline (Recommended for Standard Workflows)
For standard QC following scverse best practices, use the convenience script `scripts/qc_analysis.py`:
```bash
python3 scripts/qc_analysis.py input.h5ad
# or for 10X Genomics .h5 files:
python3 scripts/qc_analysis.py raw_feature_bc_matrix.h5
```
The script automatically detects the file format and loads it appropriately.
**When to use this approach:**
- Standard QC workflow with adjustable thresholds (all cells filtered the same way)
- Batch processing multiple datasets
- Quick exploratory analysis
- User wants the "just works" solution
**Requirements:** anndata, scanpy, scipy, matplotlib, seaborn, numpy
**Parameters:**
Customize filtering thresholds and gene patterns using command-line parameters:
- `--output-dir` - Output directory
- `--mad-counts`, `--mad-genes`, `--mad-mt` - MAD thresholds for counts/genes/MT%
- `--mt-threshold` - Hard mitochondrial % cutoff
- `--min-cells` - Gene filtering threshold
- `--mt-pattern`, `--ribo-pattern`, `--hb-pattern` - Gene name patterns for different species
Use `--help` to see current default values.
**Outputs:**
All files are saved to `<input_basename>_qc_results/` directory by default (or to the directory specified by `--output-dir`):
- `qc_metrics_before_filtering.png` - Pre-filtering visualizations
- `qc_filtering_thresholds.png` - MAD-based threshold overlays
- `qc_metrics_after_filtering.png` - Post-filtering quality metrics
- `<input_basename>_filtered.h5ad` - Clean, filtered dataset ready for downstream analysis
- `<input_basename>_with_qc.h5ad` - Original data with QC annotations preserved
If copying outputs to `/mnt/user-data/outputs/` for user access, copy individual files (not the entire directory) so users can preview them directly as Claude.ai artifacts.
### Workflow Steps
The script performs the following steps:
1. **Calculate QC metrics** - Count depth, gene detection, mitochondrial/ribosomal/hemoglobin content
2. **Apply MAD-based filtering** - Permissive outlier detection using MAD thresholds for counts/genes/MT%
3. **Filter genes** - Remove genes detected in few cells
4. **Generate visualizations** - Comprehensive before/after plots with threshold overlays
## Approach 2: Modular Building Blocks (For Custom Workflows)
For custom analysis workflows or non-standard requirements, use the modular utility functions from `scripts/qc_core.py` and `scripts/qc_plotting.py`:
```python
# Run from scripts/ directory, or add scripts/ to sys.path if needed
import anndata as ad
from qc_core import calculate_qc_metrics, detect_outliers_mad, filter_cells
from qc_plotting import plot_qc_distributions # Only if visualization needed
adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)
# ... custom analysis logic here
```
**When to use this approach:**
- Different workflow needed (skip steps, change order, apply different thresholds to subsets)
- Conditional logic (e.g., filter neurons differently than other cells)
- Partial execution (only metrics/visualization, no filtering)
- Integration with other analysis steps in a larger pipeline
- Custom filtering criteria beyond what command-line params support
**Available utility functions:**
From `qc_core.py` (core QC operations):
- `calculate_qc_metrics(adata, mt_pattern, ribo_pattern, hb_pattern, inplace=True)` - Calculate QC metrics and annotate adata
- `detect_outliers_mad(adata, metric, n_mads, verbose=True)` - MAD-based outlier detection, returns boolean mask
- `apply_hard_threshold(adata, metric, threshold, operator='>', verbose=True)` - Apply hard cutoffs, returns boolean mask
- `filter_cells(adata, mask, inplace=False)` - Apply boolean mask to filter cells
- `filter_genes(adata, min_cells=20, min_counts=None, inplace=True)` - Filter genes by detection
- `print_qc_summary(adata, label='')` - Print summary statistics
From `qc_plotting.py` (visualization):
- `plot_qc_distributions(adata, output_path, title)` - Generate comprehensive QC plots
- `plot_filtering_thresholds(adata, outlier_masks, thresholds, output_path)` - Visualize filtering thresholds
- `plot_qc_after_filtering(adata, output_path)` - Generate post-filtering plots
**Example custom workflows:**
**Example 1: Only calculate metrics and visualize, don't filter yet**
```python
adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)
plot_qc_distributions(adata, 'qc_before.png', title='Initial QC')
print_qc_summary(adata, label='Before filtering')
```
**Example 2: Apply only MT% filtering, keep other metrics permissive**
```python
adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)
# Only filter high MT% cells
high_mt = apply_hard_threshold(adata, 'pct_counts_mt', 10, operator='>')
adata_filtered = filter_cells(adata, ~high_mt)
adata_filtered.write('filtered.h5ad')
```
**Example 3: Different thresholds for different subsets**
```python
adata = ad.read_h5ad('input.h5ad')
calculate_qc_metrics(adata, inplace=True)
# Apply type-specific QC (assumes cell_type metadata exists)
neurons = adata.obs['cell_type'] == 'neuron'
other_cells = ~neurons
# Neurons tolerate higher MT%, other cells use stricter threshold
neuron_qc = apply_hard_threshold(adata[neurons], 'pct_counts_mt', 15, operator='>')
other_qc = apply_hard_threshold(adata[other_cells], 'pct_counts_mt', 8, operator='>')
```
## Best Practices
1. **Be permissive with filtering** - Default thresholds intentionally retain most cells to avoid losing rare populations
2. **Inspect visualizations** - Always review before/after plots to ensure filtering makes biological sense
3. **Consider dataset-specific factors** - Some tissues naturally have higher mitochondrial content (e.g., neurons, cardiomyocytes)
4. **Check gene annotations** - Mitochondrial gene prefixes vary by species (mt- for mouse, MT- for human)
5. **Iterate if needed** - QC parameters may need adjustment based on the specific experiment or tissue type
## Reference Materials
For detailed QC methodology, parameter rationale, and troubleshooting guidance, see `references/scverse_qc_guidelines.md`. This reference provides:
- Detailed explanations of each QC metric and why it matters
- Rationale for MAD-based thresholds and why they're better than fixed cutoffs
- Guidelines for interpreting QC visualizations (histograms, violin plots, scatter plots)
- Species-specific considerations for gene annotations
- When and how to adjust filtering parameters
- Advanced QC considerations (ambient RNA correction, doublet detection)
Load this reference when users need deeper understanding of the methodology or when troubleshooting QC issues.
## Next Steps After QC
Typical downstream analysis steps:
- Ambient RNA correction (SoupX, CellBender)
- Doublet detection (scDblFinder)
- Normalization (log-normalize, scran)
- Feature selection and dimensionality reduction
- Clustering and cell type annotation
---
## Referenced Files
> The following files are referenced in this skill and included for context.
### scripts/qc_analysis.py
```python
#!/usr/bin/env python3
"""
Quality Control Analysis for Single-Cell RNA-seq Data
Following scverse best practices from:
https://www.sc-best-practices.org/preprocessing_visualization/quality_control.html
This is a convenience script that runs a complete QC workflow using the
modular functions from qc_core.py and qc_plotting.py.
"""
import anndata as ad
import scanpy as sc
import sys
import os
import argparse
# Import our modular utilities
from qc_core import (
calculate_qc_metrics,
detect_outliers_mad,
apply_hard_threshold,
filter_cells,
filter_genes,
print_qc_summary
)
from qc_plotting import (
plot_qc_distributions,
plot_filtering_thresholds,
plot_qc_after_filtering
)
print("=" * 80)
print("Single-Cell RNA-seq Quality Control Analysis")
print("=" * 80)
# Default parameters (single source of truth)
DEFAULT_MAD_COUNTS = 5
DEFAULT_MAD_GENES = 5
DEFAULT_MAD_MT = 3
DEFAULT_MT_THRESHOLD = 8
DEFAULT_MIN_CELLS = 20
DEFAULT_MT_PATTERN = 'mt-,MT-'
DEFAULT_RIBO_PATTERN = 'Rpl,Rps,RPL,RPS'
DEFAULT_HB_PATTERN = '^Hb[^(p)]|^HB[^(P)]'
# Parse command-line arguments
parser = argparse.ArgumentParser(
description='Quality Control Analysis for Single-Cell RNA-seq Data',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
python3 qc_analysis.py data.h5ad
python3 qc_analysis.py raw_feature_bc_matrix.h5
python3 qc_analysis.py data.h5ad --mad-counts 4 --mad-genes 4 --mad-mt 2.5
python3 qc_analysis.py data.h5ad --mt-threshold 10 --min-cells 10
python3 qc_analysis.py data.h5ad --mt-pattern "^mt-" --ribo-pattern "^Rpl,^Rps"
"""
)
parser.add_argument('input_file', help='Input .h5ad or .h5 file (10X Genomics format)')
parser.add_argument('--output-dir', type=str, help='Output directory (default: <input_basename>_qc_results)')
parser.add_argument('--mad-counts', type=float, default=DEFAULT_MAD_COUNTS, help=f'MAD threshold for total counts (default: {DEFAULT_MAD_COUNTS})')
parser.add_argument('--mad-genes', type=float, default=DEFAULT_MAD_GENES, help=f'MAD threshold for gene counts (default: {DEFAULT_MAD_GENES})')
parser.add_argument('--mad-mt', type=float, default=DEFAULT_MAD_MT, help=f'MAD threshold for mitochondrial percentage (default: {DEFAULT_MAD_MT})')
parser.add_argument('--mt-threshold', type=float, default=DEFAULT_MT_THRESHOLD, help=f'Hard threshold for mitochondrial percentage (default: {DEFAULT_MT_THRESHOLD})')
parser.add_argument('--min-cells', type=int, default=DEFAULT_MIN_CELLS, help=f'Minimum cells for gene filtering (default: {DEFAULT_MIN_CELLS})')
parser.add_argument('--mt-pattern', type=str, default=DEFAULT_MT_PATTERN, help=f'Comma-separated mitochondrial gene prefixes (default: "{DEFAULT_MT_PATTERN}")')
parser.add_argument('--ribo-pattern', type=str, default=DEFAULT_RIBO_PATTERN, help=f'Comma-separated ribosomal gene prefixes (default: "{DEFAULT_RIBO_PATTERN}")')
parser.add_argument('--hb-pattern', type=str, default=DEFAULT_HB_PATTERN, help=f'Hemoglobin gene regex pattern (default: "{DEFAULT_HB_PATTERN}")')
args = parser.parse_args()
# Verify input file exists
if not os.path.exists(args.input_file):
print(f"\nError: File '{args.input_file}' not found!")
sys.exit(1)
input_file = args.input_file
base_name = os.path.splitext(os.path.basename(input_file))[0]
# Set up output directory
if args.output_dir:
output_dir = args.output_dir
else:
output_dir = f"{base_name}_qc_results"
os.makedirs(output_dir, exist_ok=True)
print(f"\nOutput directory: {output_dir}")
# Display parameters
print(f"\nParameters:")
print(f" MAD thresholds: counts={args.mad_counts}, genes={args.mad_genes}, MT%={args.mad_mt}")
print(f" MT hard threshold: {args.mt_threshold}%")
print(f" Min cells for gene filtering: {args.min_cells}")
print(f" Gene patterns: MT={args.mt_pattern}, Ribo={args.ribo_pattern}")
# Load the data
print("\n[1/5] Loading data...")
file_ext = os.path.splitext(input_file)[1].lower()
if file_ext == '.h5ad':
adata = ad.read_h5ad(input_file)
print(f"Loaded .h5ad file: {adata.n_obs} cells × {adata.n_vars} genes")
elif file_ext == '.h5':
adata = sc.read_10x_h5(input_file)
print(f"Loaded 10X .h5 file: {adata.n_obs} cells × {adata.n_vars} genes")
# Make variable names unique (10X data sometimes has duplicate gene names)
adata.var_names_make_unique()
else:
print(f"\nError: Unsupported file format '{file_ext}'. Expected .h5ad or .h5")
sys.exit(1)
# Store original counts for comparison
n_cells_original = adata.n_obs
n_genes_original = adata.n_vars
# Calculate QC metrics
print("\n[2/5] Calculating QC metrics...")
calculate_qc_metrics(adata, mt_pattern=args.mt_pattern,
ribo_pattern=args.ribo_pattern,
hb_pattern=args.hb_pattern,
inplace=True)
print(f" Found {adata.var['mt'].sum()} mitochondrial genes (pattern: {args.mt_pattern})")
print(f" Found {adata.var['ribo'].sum()} ribosomal genes (pattern: {args.ribo_pattern})")
print(f" Found {adata.var['hb'].sum()} hemoglobin genes (pattern: {args.hb_pattern})")
print_qc_summary(adata, label='QC Metrics Summary (before filtering)')
# Create before-filtering visualizations
print("\n[3/5] Creating QC visualizations...")
before_plot = os.path.join(output_dir, 'qc_metrics_before_filtering.png')
plot_qc_distributions(adata, before_plot, title='Quality Control Metrics - Before Filtering')
print(f" Saved: {before_plot}")
# Apply MAD-based filtering
print("\n[4/5] Applying MAD-based filtering thresholds...")
# Detect outliers for each metric
adata.obs['outlier_counts'] = detect_outliers_mad(adata, 'total_counts', args.mad_counts)
adata.obs['outlier_genes'] = detect_outliers_mad(adata, 'n_genes_by_counts', args.mad_genes)
adata.obs['outlier_mt'] = detect_outliers_mad(adata, 'pct_counts_mt', args.mad_mt)
# Apply hard threshold for mitochondrial content
print(f"\n Applying hard threshold for mitochondrial content (>{args.mt_threshold}%):")
high_mt_mask = apply_hard_threshold(adata, 'pct_counts_mt', args.mt_threshold, operator='>')
# Combine MT filters (MAD + hard threshold)
adata.obs['outlier_mt'] = adata.obs['outlier_mt'] | high_mt_mask
# Overall filtering decision
adata.obs['pass_qc'] = ~(
adata.obs['outlier_counts'] |
adata.obs['outlier_genes'] |
adata.obs['outlier_mt']
)
print(f"\n Total cells failing QC: {(~adata.obs['pass_qc']).sum()} ({(~adata.obs['pass_qc']).sum()/adata.n_obs*100:.2f}%)")
print(f" Cells passing QC: {adata.obs['pass_qc'].sum()} ({adata.obs['pass_qc'].sum()/adata.n_obs*100:.2f}%)")
# Visualize filtering thresholds
outlier_masks = {
'total_counts': adata.obs['outlier_counts'].values,
'n_genes_by_counts': adata.obs['outlier_genes'].values,
'pct_counts_mt': adata.obs['outlier_mt'].values
}
thresholds = {
'total_counts': {'n_mads': args.mad_counts},
'n_genes_by_counts': {'n_mads': args.mad_genes},
'pct_counts_mt': {'n_mads': args.mad_mt, 'hard': args.mt_threshold}
}
threshold_plot = os.path.join(output_dir, 'qc_filtering_thresholds.png')
plot_filtering_thresholds(adata, outlier_masks, thresholds, threshold_plot)
print(f"\n Saved: {threshold_plot}")
# Apply filtering
print("\n[5/5] Applying filters...")
adata_filtered = filter_cells(adata, adata.obs['pass_qc'].values, inplace=False)
print(f" Cells after filtering: {adata_filtered.n_obs} (removed {n_cells_original - adata_filtered.n_obs})")
# Filter genes
print(f"\n Filtering genes detected in <{args.min_cells} cells...")
filter_genes(adata_filtered, min_cells=args.min_cells, inplace=True)
print(f" Genes after filtering: {adata_filtered.n_vars} (removed {n_genes_original - adata_filtered.n_vars})")
# Generate summary statistics
print("\n" + "=" * 80)
print("QC Summary")
print("=" * 80)
print("\nBefore filtering:")
print(f" Cells: {n_cells_original}")
print(f" Genes: {n_genes_original}")
print("\nAfter filtering:")
print(f" Cells: {adata_filtered.n_obs} ({adata_filtered.n_obs/n_cells_original*100:.1f}% retained)")
print(f" Genes: {adata_filtered.n_vars} ({adata_filtered.n_vars/n_genes_original*100:.1f}% retained)")
print_qc_summary(adata_filtered, label='\nFiltered data QC metrics')
# Create after-filtering visualizations
after_plot = os.path.join(output_dir, 'qc_metrics_after_filtering.png')
plot_qc_after_filtering(adata_filtered, after_plot)
print(f"\n Saved: {after_plot}")
# Save filtered data
print("\nSaving filtered data...")
output_filtered = os.path.join(output_dir, f'{base_name}_filtered.h5ad')
output_with_qc = os.path.join(output_dir, f'{base_name}_with_qc.h5ad')
adata_filtered.write(output_filtered)
print(f" Saved: {output_filtered}")
# Also save the unfiltered data with QC annotations
adata.write(output_with_qc)
print(f" Saved: {output_with_qc} (original data with QC annotations)")
print("\n" + "=" * 80)
print("Quality Control Analysis Complete!")
print("=" * 80)
print(f"\nAll results saved to: {output_dir}/")
print("\nGenerated files:")
print(" 1. qc_metrics_before_filtering.png - Initial QC visualizations")
print(" 2. qc_filtering_thresholds.png - MAD-based threshold visualization")
print(" 3. qc_metrics_after_filtering.png - Post-filtering QC visualizations")
print(f" 4. {base_name}_filtered.h5ad - Filtered dataset")
print(f" 5. {base_name}_with_qc.h5ad - Original dataset with QC annotations")
print("\nNext steps:")
print(" - Consider ambient RNA correction (SoupX)")
print(" - Consider doublet detection (scDblFinder)")
print(" - Proceed with normalization and downstream analysis")
```
### scripts/qc_core.py
```python
#!/usr/bin/env python3
"""
Core utility functions for single-cell RNA-seq quality control.
This module provides building blocks for metrics calculation and filtering
while following scverse best practices from:
https://www.sc-best-practices.org/preprocessing_visualization/quality_control.html
"""
import anndata as ad
import scanpy as sc
import numpy as np
from scipy.stats import median_abs_deviation
def calculate_qc_metrics(adata, mt_pattern='mt-,MT-', ribo_pattern='Rpl,Rps,RPL,RPS',
hb_pattern='^Hb[^(p)]|^HB[^(P)]', inplace=True):
"""
Calculate QC metrics for single-cell RNA-seq data.
Parameters
----------
adata : AnnData
Annotated data matrix
mt_pattern : str
Comma-separated mitochondrial gene prefixes (default: 'mt-,MT-')
ribo_pattern : str
Comma-separated ribosomal gene prefixes (default: 'Rpl,Rps,RPL,RPS')
hb_pattern : str
Regex pattern for hemoglobin genes (default: '^Hb[^(p)]|^HB[^(P)]')
inplace : bool
Modify adata in place (default: True)
Returns
-------
AnnData or None
If inplace=False, returns modified AnnData. Otherwise modifies in place.
"""
if not inplace:
adata = adata.copy()
# Identify gene categories
mt_prefixes = tuple(mt_pattern.split(','))
adata.var['mt'] = adata.var_names.str.startswith(mt_prefixes)
ribo_prefixes = tuple(ribo_pattern.split(','))
adata.var['ribo'] = adata.var_names.str.startswith(ribo_prefixes)
adata.var['hb'] = adata.var_names.str.match(hb_pattern)
# Calculate QC metrics
sc.pp.calculate_qc_metrics(
adata,
qc_vars=['mt', 'ribo', 'hb'],
percent_top=None,
log1p=False,
inplace=True
)
if not inplace:
return adata
def detect_outliers_mad(adata, metric, n_mads, verbose=True):
"""
Detect outliers using Median Absolute Deviation (MAD).
Parameters
----------
adata : AnnData
Annotated data matrix with QC metrics
metric : str
Column name in adata.obs to use for outlier detection
n_mads : float
Number of MADs to use as threshold
verbose : bool
Print outlier statistics (default: True)
Returns
-------
np.ndarray
Boolean mask where True indicates outliers
"""
metric_values = adata.obs[metric]
median = np.median(metric_values)
mad = median_abs_deviation(metric_values)
# Calculate bounds
lower = median - n_mads * mad
upper = median + n_mads * mad
# Identify outliers
outlier_mask = (metric_values < lower) | (metric_values > upper)
if verbose:
print(f" {metric}:")
print(f" Median: {median:.2f}, MAD: {mad:.2f}")
print(f" Bounds: [{lower:.2f}, {upper:.2f}] ({n_mads} MADs)")
print(f" Outliers: {outlier_mask.sum()} cells ({outlier_mask.sum()/len(metric_values)*100:.2f}%)")
return outlier_mask
def apply_hard_threshold(adata, metric, threshold, operator='>', verbose=True):
"""
Apply a hard threshold filter.
Parameters
----------
adata : AnnData
Annotated data matrix
metric : str
Column name in adata.obs to filter on
threshold : float
Threshold value
operator : str
Comparison operator: '>', '<', '>=', '<=' (default: '>')
verbose : bool
Print filtering statistics (default: True)
Returns
-------
np.ndarray
Boolean mask where True indicates cells to filter out
"""
metric_values = adata.obs[metric]
if operator == '>':
mask = metric_values > threshold
elif operator == '<':
mask = metric_values < threshold
elif operator == '>=':
mask = metric_values >= threshold
elif operator == '<=':
mask = metric_values <= threshold
else:
raise ValueError(f"Invalid operator: {operator}. Use '>', '<', '>=', or '<='")
if verbose:
print(f" {metric} {operator} {threshold}:")
print(f" Cells filtered: {mask.sum()} ({mask.sum()/len(metric_values)*100:.2f}%)")
return mask
def filter_cells(adata, mask, inplace=False):
"""
Filter cells based on a boolean mask.
Parameters
----------
adata : AnnData
Annotated data matrix
mask : np.ndarray or pd.Series
Boolean mask where True indicates cells to KEEP
inplace : bool
Modify adata in place (default: False)
Returns
-------
AnnData
Filtered AnnData object
"""
if inplace:
# This is actually a bit tricky - AnnData doesn't support true inplace filtering
# Return filtered copy which caller should reassign
return adata[mask].copy()
else:
return adata[mask].copy()
def filter_genes(adata, min_cells=20, min_counts=None, inplace=True):
"""
Filter genes based on detection thresholds.
Parameters
----------
adata : AnnData
Annotated data matrix
min_cells : int
Minimum number of cells a gene must be detected in (default: 20)
min_counts : int, optional
Minimum total counts across all cells
inplace : bool
Modify adata in place (default: True)
Returns
-------
AnnData or None
If inplace=False, returns filtered AnnData
"""
if not inplace:
adata = adata.copy()
if min_cells is not None:
sc.pp.filter_genes(adata, min_cells=min_cells)
if min_counts is not None:
sc.pp.filter_genes(adata, min_counts=min_counts)
if not inplace:
return adata
def print_qc_summary(adata, label=''):
"""
Print summary statistics for QC metrics.
Parameters
----------
adata : AnnData
Annotated data matrix with QC metrics
label : str
Label to prepend to output (e.g., 'Before filtering', 'After filtering')
"""
if label:
print(f"\n{label}:")
print(f" Cells: {adata.n_obs}")
print(f" Genes: {adata.n_vars}")
if 'total_counts' in adata.obs:
print(f" Mean counts per cell: {adata.obs['total_counts'].mean():.0f}")
print(f" Median counts per cell: {adata.obs['total_counts'].median():.0f}")
if 'n_genes_by_counts' in adata.obs:
print(f" Mean genes per cell: {adata.obs['n_genes_by_counts'].mean():.0f}")
print(f" Median genes per cell: {adata.obs['n_genes_by_counts'].median():.0f}")
if 'pct_counts_mt' in adata.obs:
print(f" Mean mitochondrial %: {adata.obs['pct_counts_mt'].mean():.2f}%")
if 'pct_counts_ribo' in adata.obs:
print(f" Mean ribosomal %: {adata.obs['pct_counts_ribo'].mean():.2f}%")
```
### scripts/qc_plotting.py
```python
#!/usr/bin/env python3
"""
Visualization functions for single-cell RNA-seq quality control.
This module provides plotting utilities for QC metrics and filtering thresholds.
"""
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import median_abs_deviation
def plot_qc_distributions(adata, output_path, title='Quality Control Metrics'):
"""
Create comprehensive QC distribution plots.
Parameters
----------
adata : AnnData
Annotated data matrix with QC metrics
output_path : str
Path to save the figure
title : str
Figure title (default: 'Quality Control Metrics')
"""
fig, axes = plt.subplots(3, 3, figsize=(15, 12))
fig.suptitle(title, fontsize=16, y=0.995)
# Row 1: Histograms
axes[0, 0].hist(adata.obs['total_counts'], bins=100, color='steelblue', edgecolor='black')
axes[0, 0].set_xlabel('Total counts per cell')
axes[0, 0].set_ylabel('Number of cells')
axes[0, 0].set_title('Distribution of Total Counts')
axes[0, 0].axvline(adata.obs['total_counts'].median(), color='red', linestyle='--', label='Median')
axes[0, 0].legend()
axes[0, 1].hist(adata.obs['n_genes_by_counts'], bins=100, color='forestgreen', edgecolor='black')
axes[0, 1].set_xlabel('Genes per cell')
axes[0, 1].set_ylabel('Number of cells')
axes[0, 1].set_title('Distribution of Detected Genes')
axes[0, 1].axvline(adata.obs['n_genes_by_counts'].median(), color='red', linestyle='--', label='Median')
axes[0, 1].legend()
axes[0, 2].hist(adata.obs['pct_counts_mt'], bins=100, color='coral', edgecolor='black')
axes[0, 2].set_xlabel('Mitochondrial %')
axes[0, 2].set_ylabel('Number of cells')
axes[0, 2].set_title('Distribution of Mitochondrial Content')
axes[0, 2].axvline(adata.obs['pct_counts_mt'].median(), color='red', linestyle='--', label='Median')
axes[0, 2].legend()
# Row 2: Violin plots
axes[1, 0].violinplot([adata.obs['total_counts']], positions=[0], showmeans=True, showmedians=True)
axes[1, 0].set_ylabel('Total counts')
axes[1, 0].set_title('Total Counts per Cell')
axes[1, 0].set_xticks([])
axes[1, 1].violinplot([adata.obs['n_genes_by_counts']], positions=[0], showmeans=True, showmedians=True)
axes[1, 1].set_ylabel('Genes detected')
axes[1, 1].set_title('Genes per Cell')
axes[1, 1].set_xticks([])
axes[1, 2].violinplot([adata.obs['pct_counts_mt']], positions=[0], showmeans=True, showmedians=True)
axes[1, 2].set_ylabel('Mitochondrial %')
axes[1, 2].set_title('Mitochondrial Content')
axes[1, 2].set_xticks([])
# Row 3: Scatter plots
scatter1 = axes[2, 0].scatter(
adata.obs['total_counts'],
adata.obs['n_genes_by_counts'],
c=adata.obs['pct_counts_mt'],
cmap='viridis',
alpha=0.5,
s=10
)
axes[2, 0].set_xlabel('Total counts')
axes[2, 0].set_ylabel('Genes detected')
axes[2, 0].set_title('Counts vs Genes (colored by MT%)')
plt.colorbar(scatter1, ax=axes[2, 0], label='MT %')
axes[2, 1].scatter(
adata.obs['total_counts'],
adata.obs['pct_counts_mt'],
alpha=0.5,
s=10,
color='coral'
)
axes[2, 1].set_xlabel('Total counts')
axes[2, 1].set_ylabel('Mitochondrial %')
axes[2, 1].set_title('Total Counts vs Mitochondrial %')
axes[2, 2].scatter(
adata.obs['n_genes_by_counts'],
adata.obs['pct_counts_mt'],
alpha=0.5,
s=10,
color='forestgreen'
)
axes[2, 2].set_xlabel('Genes detected')
axes[2, 2].set_ylabel('Mitochondrial %')
axes[2, 2].set_title('Genes vs Mitochondrial %')
plt.tight_layout()
plt.savefig(output_path, dpi=300, bbox_inches='tight')
plt.close()
def plot_filtering_thresholds(adata, outlier_masks, thresholds, output_path):
"""
Visualize filtering thresholds overlaid on distributions.
Parameters
----------
adata : AnnData
Annotated data matrix with QC metrics
outlier_masks : dict
Dictionary mapping metric names to boolean outlier masks
Example: {'total_counts': mask1, 'n_genes_by_counts': mask2, 'pct_counts_mt': mask3}
thresholds : dict
Dictionary with threshold information for each metric
Example: {'total_counts': {'n_mads': 5}, 'pct_counts_mt': {'n_mads': 3, 'hard': 8}}
output_path : str
Path to save the figure
"""
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
fig.suptitle('MAD-Based Filtering Thresholds', fontsize=16)
# Helper function to plot with thresholds
def plot_with_threshold(ax, metric, outlier_mask, n_mads, hard_threshold=None):
data = adata.obs[metric]
median = np.median(data)
mad = median_abs_deviation(data)
lower = median - n_mads * mad
upper = median + n_mads * mad
ax.hist(data[~outlier_mask], bins=100, alpha=0.7, label='Pass QC', color='steelblue')
ax.hist(data[outlier_mask], bins=100, alpha=0.7, label='Fail QC', color='coral')
ax.axvline(lower, color='red', linestyle='--', linewidth=2, label=f'Thresholds ({n_mads} MADs)')
ax.axvline(upper, color='red', linestyle='--', linewidth=2)
if hard_threshold is not None:
ax.axvline(hard_threshold, color='darkred', linestyle=':', linewidth=2,
label=f'Hard threshold ({hard_threshold})')
ax.set_xlabel(metric.replace('_', ' ').title())
ax.set_ylabel('Number of cells')
ax.legend()
# Plot each metric
metrics = [
('total_counts', 'Total Counts'),
('n_genes_by_counts', 'Genes Detected'),
('pct_counts_mt', 'Mitochondrial %')
]
for idx, (metric, label) in enumerate(metrics):
if metric in outlier_masks and metric in thresholds:
hard = thresholds[metric].get('hard', None)
plot_with_threshold(axes[idx], metric, outlier_masks[metric],
thresholds[metric]['n_mads'], hard)
plt.tight_layout()
plt.savefig(output_path, dpi=300, bbox_inches='tight')
plt.close()
def plot_qc_after_filtering(adata, output_path):
"""
Create QC plots for filtered data (simplified version without outlier overlay).
Parameters
----------
adata : AnnData
Filtered annotated data matrix with QC metrics
output_path : str
Path to save the figure
"""
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
fig.suptitle('Quality Control Metrics - After Filtering', fontsize=16, y=0.995)
# Row 1: Histograms
axes[0, 0].hist(adata.obs['total_counts'], bins=100, color='steelblue', edgecolor='black')
axes[0, 0].set_xlabel('Total counts per cell')
axes[0, 0].set_ylabel('Number of cells')
axes[0, 0].set_title('Distribution of Total Counts')
axes[0, 1].hist(adata.obs['n_genes_by_counts'], bins=100, color='forestgreen', edgecolor='black')
axes[0, 1].set_xlabel('Genes per cell')
axes[0, 1].set_ylabel('Number of cells')
axes[0, 1].set_title('Distribution of Detected Genes')
axes[0, 2].hist(adata.obs['pct_counts_mt'], bins=100, color='coral', edgecolor='black')
axes[0, 2].set_xlabel('Mitochondrial %')
axes[0, 2].set_ylabel('Number of cells')
axes[0, 2].set_title('Distribution of Mitochondrial Content')
# Row 2: Scatter plots
scatter1 = axes[1, 0].scatter(
adata.obs['total_counts'],
adata.obs['n_genes_by_counts'],
c=adata.obs['pct_counts_mt'],
cmap='viridis',
alpha=0.5,
s=10
)
axes[1, 0].set_xlabel('Total counts')
axes[1, 0].set_ylabel('Genes detected')
axes[1, 0].set_title('Counts vs Genes (colored by MT%)')
plt.colorbar(scatter1, ax=axes[1, 0], label='MT %')
axes[1, 1].scatter(
adata.obs['total_counts'],
adata.obs['pct_counts_mt'],
alpha=0.5,
s=10,
color='coral'
)
axes[1, 1].set_xlabel('Total counts')
axes[1, 1].set_ylabel('Mitochondrial %')
axes[1, 1].set_title('Total Counts vs Mitochondrial %')
axes[1, 2].scatter(
adata.obs['n_genes_by_counts'],
adata.obs['pct_counts_mt'],
alpha=0.5,
s=10,
color='forestgreen'
)
axes[1, 2].set_xlabel('Genes detected')
axes[1, 2].set_ylabel('Mitochondrial %')
axes[1, 2].set_title('Genes vs Mitochondrial %')
plt.tight_layout()
plt.savefig(output_path, dpi=300, bbox_inches='tight')
plt.close()
```
### references/scverse_qc_guidelines.md
```markdown
# scverse Quality Control Guidelines
This document provides detailed information about quality control best practices for single-cell RNA-seq data, following the scverse ecosystem recommendations.
## Quality Control Metrics
### Count Depth (Total Counts)
- **What it measures**: Total number of UMI/reads per cell
- **Why it matters**: Low count cells may be empty droplets, debris, or poorly captured cells
- **Typical range**: 500-50,000 counts per cell (varies by protocol)
- **Red flags**: Bimodal distributions may indicate mixing of high and low-quality cells
### Gene Detection (Genes per Cell)
- **What it measures**: Number of genes with at least 1 count
- **Why it matters**: Strongly correlates with count depth; low values indicate poor capture
- **Typical range**: 200-5,000 genes per cell
- **Red flags**: Very low values (<200) suggest technical failures
### Mitochondrial Content
- **What it measures**: Percentage of counts from mitochondrial genes
- **Why it matters**: High MT% indicates cell stress, apoptosis, or lysed cells
- **Typical range**: <5% for most tissues, up to 10-15% for metabolically active cells
- **Species-specific patterns**:
- Mouse: Genes start with 'mt-' (e.g., mt-Nd1, mt-Co1)
- Human: Genes start with 'MT-' (e.g., MT-ND1, MT-CO1)
- **Context matters**: Some cell types (cardiomyocytes, neurons) naturally have higher MT content
### Ribosomal Content
- **What it measures**: Percentage of counts from ribosomal protein genes
- **Why it matters**: Can indicate cell state or contamination
- **Patterns**: Genes start with 'Rpl'/'RPL' (large subunit) or 'Rps'/'RPS' (small subunit)
- **Note**: High ribosomal content isn't always bad - metabolically active cells have more ribosomes
### Hemoglobin Content
- **What it measures**: Percentage of counts from hemoglobin genes
- **Why it matters**: Indicates blood contamination in non-blood tissues
- **Patterns**: Genes matching '^Hb[^(p)]' or '^HB[^(P)]' (excludes Hbp1/HBP1)
- **When to use**: Particularly important for tissue samples (brain, liver, etc.)
## MAD-Based Filtering Rationale
### Why MAD Instead of Fixed Thresholds?
Fixed thresholds (e.g., "remove cells with <500 genes") fail because:
- Different protocols yield different ranges
- Different tissues have different characteristics
- Different species have different gene counts
- Fixed thresholds are arbitrary and not data-driven
MAD (Median Absolute Deviation) is robust to outliers and adapts to your dataset:
```
MAD = median(|X - median(X)|)
Outlier bounds = median ± n_MADs × MAD
```
### Recommended MAD Thresholds
Following scverse best practices (deliberately permissive):
**5 MADs for count depth (log-transformed)**
- Very permissive to retain rare cell populations
- Catches extreme outliers (empty droplets, debris)
- Log transformation handles the typical right-skewed distribution
**5 MADs for gene counts (log-transformed)**
- Parallels count depth filtering
- Most informative when combined with count filtering
- Log transformation normalizes the distribution
**3 MADs for mitochondrial percentage**
- More stringent because high MT% strongly indicates dying cells
- Uses raw percentages (not log-transformed)
- Combined with hard threshold for extra stringency
**Hard threshold: 8% mitochondrial content**
- Additional filter beyond MAD-based detection
- Conservative cutoff that works across most tissues
- Adjust higher (10-15%) for metabolically active cell types
### Why Be Permissive?
The default thresholds intentionally err on the side of keeping cells because:
1. **Rare populations**: Stringent filtering may remove rare but viable cell types
2. **Biological variation**: Some healthy cells naturally have extreme values
3. **Reversibility**: Easier to filter more later than to recover lost cells
4. **Downstream robustness**: Modern normalization methods handle moderate quality variation
## Interpreting QC Visualizations
### Histograms
- **Bimodal distributions**: May indicate mixing of cell types or quality issues
- **Long tails**: Common for count depth; MAD filtering handles this
- **Sharp cutoffs**: May indicate prior filtering or technical artifacts
### Violin Plots
- Shows distribution shape and density
- Median (line) and mean (diamond) should be similar for symmetric distributions
- Wide distributions suggest high heterogeneity
### Scatter Plots
**Counts vs Genes (colored by MT%)**
- Should show strong positive correlation (R² > 0.8 typical)
- Points deviating from trend may be outliers
- High MT% cells often cluster at low counts/genes
**Counts vs MT%**
- Negative correlation expected (dying cells have fewer counts)
- Vertical stratification may indicate batch effects
- Cells with high counts + high MT% are suspicious
**Genes vs MT%**
- Similar to counts vs MT%, but often weaker correlation
- Useful for identifying cells with gene detection issues
## Gene Filtering
After filtering cells, remove genes detected in fewer than 20 cells:
- **Why 20?**: Balances noise reduction with information retention
- **Benefits**: Reduces dataset size, speeds up computation, removes noisy genes
- **Trade-offs**: May lose very rare markers; adjust to 10 if studying rare populations
## Species-Specific Considerations
### Mouse (Mus musculus)
- Mitochondrial genes: mt-* (lowercase)
- Ribosomal genes: Rpl*, Rps* (capitalized first letter)
- Hemoglobin genes: Hb* (but not Hbp1)
### Human (Homo sapiens)
- Mitochondrial genes: MT-* (uppercase)
- Ribosomal genes: RPL*, RPS* (all uppercase)
- Hemoglobin genes: HB* (but not HBP1)
### Other Species
Adjust gene name patterns in the script to match your organism's gene nomenclature. Consult Ensembl or your reference annotation for correct prefixes.
## When to Adjust Parameters
Consider adjusting filtering thresholds when:
**More stringent (lower MADs)**
- High ambient RNA contamination suspected
- Many low-quality cells observed in visualizations
- Downstream analysis shows quality-driven clustering
**More permissive (higher MADs)**
- Studying rare cell populations
- Dataset has high technical quality
- Cell types naturally have extreme values (e.g., neurons with high MT%)
**Tissue-specific adjustments**
- Brain/neurons: May need higher MT% threshold (10-15%)
- Blood: Can be more stringent with MT% (5-8%)
- Tumor samples: Often need more permissive thresholds due to biological variation
## Advanced QC Considerations
### Not Included in This Workflow
**Ambient RNA correction**
- Tool: SoupX, CellBender, DecontX
- When: High background RNA in droplet-based data
- Effect: Removes contamination from lysed cells
**Doublet detection**
- Tool: scDblFinder, scrublet, DoubletFinder
- When: Always recommended for droplet-based data
- Effect: Identifies and removes multiplets (2+ cells in one droplet)
**Cell cycle scoring**
- Tool: scanpy's score_genes_cell_cycle
- When: Cell cycle effects confound biological signal
- Effect: Allows regressing out or accounting for cell cycle phase
**Batch correction**
- Tool: Harmony, scVI, ComBat
- When: Integrating data from multiple batches/experiments
- Effect: Removes technical batch effects while preserving biology
## References
- scverse Best Practices: https://www.sc-best-practices.org/preprocessing_visualization/quality_control.html
- Luecken & Theis (2019): Current best practices in single-cell RNA-seq analysis
- Osorio & Cai (2021): Systematic determination of the mitochondrial proportion in human and mouse genomes
- Germain et al. (2020): Doublet identification in single-cell sequencing data using scDblFinder
```