SkillHub ClubShip Full StackFull StackIntegration

bio-fasta

Read/write FASTA, GenBank, FASTQ files. Sequence manipulation (complement, translate). Indexed random access via faidx. For NGS pipelines (SAM/BAM/VCF), use pysam. For BLAST, use gget or blat-integration.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars

781

Hot score

Updated

March 20, 2026

Overall rating

C4.6

Composite score

4.6

Best-practice grade

C62.8

Install command

npx @skill-hub/cli install benchflow-ai-skillsbench-bio-seq

Repository

benchflow-ai/SkillsBench

Skill path: registry/terminal_bench_2.0/full_batch_reviewed/terminal_bench_2_0_protein-assembly/environment/skills/bio-seq

Open repository

Best for

Primary workflow: Ship Full Stack.

Technical facets: Full Stack, Integration.

Target audience: everyone.

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: benchflow-ai.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

Install bio-fasta into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
Review https://github.com/benchflow-ai/SkillsBench before adding bio-fasta to shared team environments
Use bio-fasta for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: bio-fasta
description: "Read/write FASTA, GenBank, FASTQ files. Sequence manipulation (complement, translate). Indexed random access via faidx. For NGS pipelines (SAM/BAM/VCF), use pysam. For BLAST, use gget or blat-integration."
user_invocable: true
---

# Sequence I/O

Read, write, and manipulate biological sequence files (FASTA, GenBank, FASTQ).

## When to Use This Skill

This skill should be used when:

- Reading or writing sequence files (FASTA, GenBank, FASTQ)
- Converting between sequence file formats
- Manipulating sequences (complement, reverse complement, translate)
- Extracting sequences from large indexed FASTA files (faidx)
- Calculating sequence statistics (GC content, molecular weight, Tm)

## When NOT to Use This Skill

- **NGS alignment files (SAM/BAM/VCF)** → Use `pysam`
- **BLAST searches** → Use `gget` (quick) or `blat-integration` (large-scale)
- **Multiple sequence alignment** → Use `msa-advanced`
- **Phylogenetic analysis** → Use `etetoolkit`
- **NCBI database queries** → Use `pubmed-database` or `gene-database`

## Tool Selection Guide

| Task | Tool | Reference |
|------|------|-----------|
| Parse FASTA/GenBank/FASTQ | `Bio.SeqIO` | `biopython_seqio.md` |
| Convert file formats | `Bio.SeqIO.convert()` | `biopython_seqio.md` |
| Sequence operations | `Bio.Seq` | `biopython_seqio.md` |
| Large FASTA random access | `pysam.FastaFile` + faidx | `faidx.md` |
| GC%, Tm, molecular weight | `Bio.SeqUtils` | `utilities.md` |

## Quick Start

### Installation

```bash
uv pip install biopython pysam
```

### Read FASTA

```python
from Bio import SeqIO

for record in SeqIO.parse("sequences.fasta", "fasta"):
    print(f"{record.id}: {len(record.seq)} bp")
```

### Convert GenBank to FASTA

```python
from Bio import SeqIO

SeqIO.convert("input.gb", "genbank", "output.fasta", "fasta")
```

### Random Access with faidx

```python
import pysam

# Create index (once)
pysam.faidx("reference.fasta")

# Random access
fasta = pysam.FastaFile("reference.fasta")
seq = fasta.fetch("chr1", 1000, 2000)  # 0-based coordinates
fasta.close()
```

### Sequence Operations

```python
from Bio.Seq import Seq

seq = Seq("ATGCGATCGATCG")
print(seq.complement())
print(seq.reverse_complement())
print(seq.translate())
```

## Reference Documentation

Consult the appropriate reference file for detailed documentation:

### `references/biopython_seqio.md`

- `Bio.Seq` object and sequence operations
- `Bio.SeqIO` for file parsing and writing
- `SeqRecord` object and annotations
- Supported file formats
- Format conversion patterns

### `references/faidx.md`

- Creating FASTA index with `pysam.faidx()`
- `pysam.FastaFile` for random access
- Coordinate systems (0-based vs 1-based)
- Performance considerations for large files
- Common patterns (variant context, gene extraction)

### `references/utilities.md`

- GC content calculation (`gc_fraction`)
- Molecular weight (`molecular_weight`)
- Melting temperature (`MeltingTemp`)
- Codon usage analysis
- Restriction enzyme sites

### `references/formats.md`

- FASTA format specification
- GenBank format specification
- FASTQ format and quality scores
- Format detection and validation

## Coordinate Systems

**Biopython**: Uses Python-style 0-based, half-open intervals for slicing.

**pysam.FastaFile.fetch()**:
- Numeric arguments: 0-based (`fetch("chr1", 999, 2000)` = positions 999-1999)
- Region strings: 1-based (`fetch("chr1:1000-2000")` = positions 1000-2000)

## Common Pitfalls

1. **Coordinate confusion**: Remember which tool uses 0-based vs 1-based
2. **Missing faidx index**: Random access requires `.fai` file
3. **Format mismatch**: Verify file format matches the format string in `SeqIO.parse()`
4. **Iterator exhaustion**: `SeqIO.parse()` returns an iterator; convert to list if multiple passes needed
5. **Large files**: Use iterators, not `list()`, for memory efficiency


---

## Referenced Files

> The following files are referenced in this skill and included for context.

### references/biopython_seqio.md

```markdown
# Biopython Sequence Handling (Bio.Seq & Bio.SeqIO)

## Bio.Seq: The Seq Object

### Creating Sequences

```python
from Bio.Seq import Seq

# Create a basic sequence
my_seq = Seq("AGTACACTGGT")

# Sequences support string-like operations
print(len(my_seq))      # Length
print(my_seq[0:5])      # Slicing
print(my_seq.upper())   # Case conversion
```

### Core Sequence Operations

```python
from Bio.Seq import Seq

seq = Seq("ATGCGATCGATCG")

# Complement and reverse complement
complement = seq.complement()
rev_comp = seq.reverse_complement()

# Transcription (DNA to RNA)
rna = seq.transcribe()

# Translation (to protein)
protein = seq.translate()

# Back-transcription (RNA to DNA)
dna = rna.back_transcribe()
```

### Sequence Methods Reference

| Method | Description |
|--------|-------------|
| `complement()` | Returns complementary strand |
| `reverse_complement()` | Returns reverse complement |
| `transcribe()` | DNA to RNA transcription |
| `back_transcribe()` | RNA to DNA conversion |
| `translate()` | Translate to protein sequence |
| `translate(table=N)` | Use specific genetic code table |
| `translate(to_stop=True)` | Stop at first stop codon |

### Translation Options

```python
# Standard translation
protein = seq.translate()

# Using alternative genetic code (e.g., mitochondrial)
protein = seq.translate(table=2)

# Stop at first stop codon
protein = seq.translate(to_stop=True)

# Include stop codon as '*'
protein = seq.translate(stop_symbol="*")
```

## Bio.SeqIO: Sequence File I/O

### Core Functions

#### SeqIO.parse() - Read Multiple Records

Primary function for reading sequence files as an iterator of `SeqRecord` objects.

```python
from Bio import SeqIO

# Parse a FASTA file
for record in SeqIO.parse("sequences.fasta", "fasta"):
    print(record.id)
    print(record.seq)
    print(len(record))
```

#### SeqIO.read() - Read Single Record

For files containing exactly one record.

```python
record = SeqIO.read("single.fasta", "fasta")
```

#### SeqIO.write() - Write Records

Output SeqRecord objects to files.

```python
from Bio import SeqIO

# Write records to file
records = [record1, record2, record3]
count = SeqIO.write(records, "output.fasta", "fasta")
print(f"Wrote {count} records")
```

#### SeqIO.convert() - Format Conversion

Streamlined format conversion.

```python
# Convert between formats
count = SeqIO.convert("input.gbk", "genbank", "output.fasta", "fasta")
```

#### SeqIO.index() - Random Access by ID

Create dictionary-like access to records by identifier.

```python
# Index a file for random access
record_dict = SeqIO.index("sequences.fasta", "fasta")

# Access by ID
record = record_dict["seq_id_123"]

# Close when done
record_dict.close()
```

#### SeqIO.to_dict() - Load All to Memory

Load all records into a dictionary (for small files only).

```python
# Load all records into memory
records = SeqIO.to_dict(SeqIO.parse("small.fasta", "fasta"))
record = records["seq_id"]
```

### Supported File Formats

| Format | Description |
|--------|-------------|
| `fasta` | FASTA format |
| `fastq` | FASTQ format (with quality scores) |
| `genbank` or `gb` | GenBank format |
| `embl` | EMBL format |
| `swiss` | SwissProt format |
| `fasta-2line` | FASTA with sequence on one line |
| `tab` | Simple tab-separated format |
| `qual` | Quality scores only |
| `pir` | PIR/NBRF format |
| `clustal` | Clustal alignment format |
| `phylip` | PHYLIP format |
| `stockholm` | Stockholm alignment format |

### The SeqRecord Object

`SeqRecord` combines sequence data with annotations.

```python
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq

# Create a SeqRecord
record = SeqRecord(
    Seq("ATGCGATCGATCG"),
    id="my_sequence",
    name="MySeq",
    description="Example sequence"
)

# Access attributes
print(record.id)          # Identifier
print(record.seq)         # Sequence (Seq object)
print(record.name)        # Name
print(record.description) # Description
print(record.annotations) # Dictionary of annotations
print(record.features)    # List of SeqFeature objects
print(record.letter_annotations)  # Per-letter annotations
```

### Common Patterns

#### Filter Sequences by Length

```python
from Bio import SeqIO

def filter_by_length(input_file, output_file, min_length=100):
    """Keep only sequences longer than min_length."""
    records = (r for r in SeqIO.parse(input_file, "fasta")
               if len(r.seq) >= min_length)
    count = SeqIO.write(records, output_file, "fasta")
    return count
```

#### Extract Subsequences

```python
from Bio import SeqIO

for record in SeqIO.parse("sequences.fasta", "fasta"):
    # Get first 100 bases
    subseq = record.seq[:100]

    # Create new record with subsequence
    new_record = record[:100]  # Preserves annotations
```

#### Batch Processing

```python
from Bio import SeqIO

def process_in_batches(input_file, batch_size=1000):
    """Process sequences in batches."""
    batch = []
    for record in SeqIO.parse(input_file, "fasta"):
        batch.append(record)
        if len(batch) >= batch_size:
            yield batch
            batch = []
    if batch:
        yield batch
```

#### Modify and Write

```python
from Bio import SeqIO

def add_prefix_to_ids(input_file, output_file, prefix):
    """Add prefix to all sequence IDs."""
    records = []
    for record in SeqIO.parse(input_file, "fasta"):
        record.id = f"{prefix}_{record.id}"
        record.description = ""  # Clear to avoid duplication
        records.append(record)
    SeqIO.write(records, output_file, "fasta")
```

## Working with GenBank Files

GenBank files contain rich annotations.

```python
from Bio import SeqIO

record = SeqIO.read("sequence.gb", "genbank")

# Access annotations
print(record.annotations["organism"])
print(record.annotations["taxonomy"])

# Iterate through features
for feature in record.features:
    if feature.type == "CDS":
        print(f"CDS: {feature.location}")
        print(f"  Product: {feature.qualifiers.get('product', ['N/A'])[0]}")
        print(f"  Translation: {feature.qualifiers.get('translation', ['N/A'])[0][:50]}...")
```

### Extract CDS Sequences

```python
from Bio import SeqIO

record = SeqIO.read("genome.gb", "genbank")

for feature in record.features:
    if feature.type == "CDS":
        # Extract DNA sequence for this CDS
        cds_seq = feature.extract(record.seq)

        # Get protein sequence from annotation
        if "translation" in feature.qualifiers:
            protein = feature.qualifiers["translation"][0]
```

## Working with FASTQ Files

FASTQ files include quality scores.

```python
from Bio import SeqIO

for record in SeqIO.parse("reads.fastq", "fastq"):
    print(f"ID: {record.id}")
    print(f"Seq: {record.seq}")
    print(f"Qual: {record.letter_annotations['phred_quality']}")
```

### Quality Filtering

```python
from Bio import SeqIO

def filter_by_quality(input_file, output_file, min_quality=20):
    """Keep reads with mean quality >= min_quality."""
    def mean_quality(record):
        return sum(record.letter_annotations["phred_quality"]) / len(record)

    good_reads = (r for r in SeqIO.parse(input_file, "fastq")
                  if mean_quality(r) >= min_quality)
    count = SeqIO.write(good_reads, output_file, "fastq")
    return count
```

## Best Practices

1. **Use iterators for large files** - Avoid `list(SeqIO.parse(...))` for large files
2. **Use SeqIO.index() for random access** - More memory efficient than to_dict()
3. **Close file handles** - Use context managers or explicit close()
4. **Validate format** - Check file format matches the format string
5. **Handle missing data** - Not all records have all fields

```

### references/faidx.md

```markdown
# Indexed FASTA Access with faidx

## Overview

For large FASTA files, random access by genomic coordinates requires an index file (`.fai`). The `pysam` library provides `FastaFile` class for efficient indexed access.

## Creating a FASTA Index

```python
import pysam

# Create index using pysam
pysam.faidx("reference.fasta")

# Or using samtools command
pysam.samtools.faidx("reference.fasta")
```

This creates a `.fai` index file required for random access.

### Index File Format

The `.fai` file is a tab-delimited text file with columns:

| Column | Description |
|--------|-------------|
| NAME | Sequence name |
| LENGTH | Sequence length in bases |
| OFFSET | Byte offset of first base |
| LINEBASES | Bases per line |
| LINEWIDTH | Bytes per line (including newline) |

## Opening FASTA Files

```python
import pysam

# Open indexed FASTA file
fasta = pysam.FastaFile("reference.fasta")

# Automatically looks for reference.fasta.fai index
```

## FastaFile Properties

```python
fasta = pysam.FastaFile("reference.fasta")

# List of reference sequences
references = fasta.references
print(f"References: {references}")

# Get lengths
lengths = fasta.lengths
print(f"Lengths: {lengths}")

# Get specific sequence length
chr1_length = fasta.get_reference_length("chr1")

# Number of references
n_refs = fasta.nreferences

# Close when done
fasta.close()
```

## Fetching Sequences

### Coordinate Systems

**Critical:** pysam uses **0-based, half-open** coordinates (Python convention):
- Start positions are 0-based (first base is position 0)
- End positions are exclusive (not included in the range)
- Region 1000-2000 includes bases 1000-1999 (1000 bases total)

### Fetch by Numeric Coordinates (0-based)

```python
# Fetch specific region (0-based)
sequence = fasta.fetch("chr1", 1000, 2000)
print(f"Sequence: {sequence}")  # Returns 1000 bases (positions 1000-1999)

# Fetch entire chromosome
chr1_seq = fasta.fetch("chr1")
```

### Fetch by Region String (1-based)

```python
# Fetch using region string (1-based, samtools convention)
sequence = fasta.fetch(region="chr1:1001-2000")

# Equivalent to:
sequence = fasta.fetch("chr1", 1000, 2000)  # 0-based
```

## Common Patterns

### Get Sequence Context Around a Variant

```python
def get_variant_context(fasta, chrom, pos, window=10):
    """Get sequence context around a variant position (1-based input)."""
    start = max(0, pos - window - 1)  # Convert to 0-based
    end = pos + window
    return fasta.fetch(chrom, start, end)

# Usage
context = get_variant_context(fasta, "chr1", 12345, window=20)
```

### Get Gene Sequence with Strand Awareness

```python
def get_gene_sequence(fasta, chrom, start, end, strand):
    """Get gene sequence with strand awareness (0-based input)."""
    seq = fasta.fetch(chrom, start, end)

    if strand == "-":
        # Reverse complement
        complement = str.maketrans("ATGCatgc", "TACGtacg")
        seq = seq.translate(complement)[::-1]

    return seq

# Usage
gene_seq = get_gene_sequence(fasta, "chr1", 1000, 2000, "-")
```

### Verify Reference Allele at Position

```python
def check_ref_allele(fasta, chrom, pos, expected_ref):
    """Verify reference allele at position (1-based pos)."""
    actual = fasta.fetch(chrom, pos - 1, pos)  # Convert to 0-based
    return actual.upper() == expected_ref.upper()

# Usage
is_correct = check_ref_allele(fasta, "chr1", 12345, "A")
```

### Extract Multiple Regions

```python
# Extract multiple regions efficiently
regions = [
    ("chr1", 1000, 2000),
    ("chr1", 5000, 6000),
    ("chr2", 10000, 11000)
]

sequences = {}
for chrom, start, end in regions:
    seq_id = f"{chrom}:{start}-{end}"
    sequences[seq_id] = fasta.fetch(chrom, start, end)
```

### Check N Content

```python
def has_quality_sequence(fasta, chrom, start, end, max_n_frac=0.1):
    """Check if region has acceptable N content."""
    seq = fasta.fetch(chrom, start, end)
    n_count = seq.upper().count('N')
    return (n_count / len(seq)) <= max_n_frac

# Usage
is_good = has_quality_sequence(fasta, "chr1", 1000, 2000, max_n_frac=0.05)
```

## Working with Ambiguous Bases

FASTA files may contain IUPAC ambiguity codes:

| Code | Meaning |
|------|---------|
| N | Any base |
| R | A or G (purine) |
| Y | C or T (pyrimidine) |
| S | G or C (strong) |
| W | A or T (weak) |
| K | G or T (keto) |
| M | A or C (amino) |
| B | C, G, or T (not A) |
| D | A, G, or T (not C) |
| H | A, C, or T (not G) |
| V | A, C, or G (not T) |

```python
def count_ambiguous(sequence):
    """Count non-ATGC bases."""
    return sum(1 for base in sequence.upper() if base not in "ATGC")
```

## Context Manager Usage

```python
import pysam

# Using context manager (recommended)
with pysam.FastaFile("reference.fasta") as fasta:
    seq = fasta.fetch("chr1", 1000, 2000)
    # File automatically closed when exiting block
```

## Performance Tips

1. **Always use indexed FASTA** - Create `.fai` with `pysam.faidx()`
2. **Batch fetch operations** - Minimize open/close cycles
3. **Cache frequently accessed sequences** - Store in memory if reused
4. **Use appropriate window sizes** - Avoid loading excessive data
5. **Close files explicitly** - Free resources when done

## Common Pitfalls

1. **Coordinate confusion** - Remember 0-based numeric vs 1-based string
2. **Missing index** - Random access requires `.fai` file
3. **Case sensitivity** - FASTA preserves case; use `.upper()` for consistent comparison
4. **Chromosome naming** - Ensure names match exactly (e.g., "chr1" vs "1")
5. **Out of bounds** - Fetching beyond sequence length returns truncated result

## Comparison with Bio.SeqIO

| Feature | pysam.FastaFile | Bio.SeqIO |
|---------|-----------------|-----------|
| Random access | Yes (with index) | Yes (with SeqIO.index) |
| Memory usage | Low (indexed) | Varies |
| Coordinate input | Numeric (0-based) | N/A (by ID only) |
| Best for | Large genomes, region queries | Format conversion, annotations |

```

### references/utilities.md

```markdown
# Sequence Utilities (Bio.SeqUtils)

## Overview

Biopython's `Bio.SeqUtils` module provides utilities for calculating sequence statistics including GC content, molecular weight, melting temperature, and codon usage.

## GC Content

### Basic GC Calculation

```python
from Bio.SeqUtils import gc_fraction
from Bio.Seq import Seq

seq = Seq("ATGCGATCGATCG")
gc = gc_fraction(seq)
print(f"GC content: {gc:.2%}")  # Output: GC content: 53.85%
```

### GC Content Variants

```python
from Bio.SeqUtils import gc_fraction

# Standard GC fraction (G+C)/(A+T+G+C)
gc = gc_fraction(seq)

# For ambiguous bases, use:
from Bio.SeqUtils import GC  # Deprecated but handles ambiguous
```

### GC Skew

```python
def gc_skew(seq):
    """Calculate GC skew: (G-C)/(G+C)"""
    g = seq.upper().count('G')
    c = seq.upper().count('C')
    if g + c == 0:
        return 0
    return (g - c) / (g + c)
```

## Molecular Weight

### DNA Molecular Weight

```python
from Bio.SeqUtils import molecular_weight
from Bio.Seq import Seq

dna_seq = Seq("ATGCGATCGATCG")

# Single-stranded DNA
mw_ss = molecular_weight(dna_seq, seq_type="DNA")
print(f"MW (ssDNA): {mw_ss:.2f} Da")

# Double-stranded DNA
mw_ds = molecular_weight(dna_seq, seq_type="DNA", double_stranded=True)
print(f"MW (dsDNA): {mw_ds:.2f} Da")
```

### RNA Molecular Weight

```python
from Bio.SeqUtils import molecular_weight
from Bio.Seq import Seq

rna_seq = Seq("AUGCGAUCGAUCG")
mw = molecular_weight(rna_seq, seq_type="RNA")
print(f"MW (RNA): {mw:.2f} Da")
```

### Protein Molecular Weight

```python
from Bio.SeqUtils import molecular_weight
from Bio.Seq import Seq

protein_seq = Seq("MRHILK")
mw = molecular_weight(protein_seq, seq_type="protein")
print(f"MW (protein): {mw:.2f} Da")
```

## Melting Temperature (Tm)

### Basic Tm Calculation

```python
from Bio.SeqUtils import MeltingTemp as mt
from Bio.Seq import Seq

seq = Seq("ATGCGATCGATCG")

# Wallace rule (simple, for short oligos)
tm_wallace = mt.Tm_Wallace(seq)

# GC method
tm_gc = mt.Tm_GC(seq)

# Nearest-neighbor method (most accurate)
tm_nn = mt.Tm_NN(seq)

print(f"Tm (Wallace): {tm_wallace:.1f}°C")
print(f"Tm (GC): {tm_gc:.1f}°C")
print(f"Tm (NN): {tm_nn:.1f}°C")
```

### Nearest-Neighbor with Custom Parameters

```python
from Bio.SeqUtils import MeltingTemp as mt
from Bio.Seq import Seq

seq = Seq("ATGCGATCGATCG")

# With salt concentration
tm = mt.Tm_NN(seq, Na=50, K=0, Tris=0, Mg=0, dNTPs=0)

# With primer concentration
tm = mt.Tm_NN(seq, dnac1=250, dnac2=0)  # nM concentrations

print(f"Tm: {tm:.1f}°C")
```

### Tm Methods Comparison

| Method | Use Case | Accuracy |
|--------|----------|----------|
| `Tm_Wallace` | Quick estimate, short oligos (<14 bp) | Low |
| `Tm_GC` | General purpose, medium oligos | Medium |
| `Tm_NN` | PCR primers, accurate calculation | High |

## Codon Usage

### Count Codons

```python
from Bio.SeqUtils import CodonUsage
from Bio.Seq import Seq

# Create codon usage index
cds = Seq("ATGAAACCCGGGTTTTAA")
codon_count = {}
for i in range(0, len(cds) - 2, 3):
    codon = str(cds[i:i+3])
    codon_count[codon] = codon_count.get(codon, 0) + 1

print(codon_count)
```

### Codon Adaptation Index (CAI)

```python
from Bio.SeqUtils.CodonUsage import CodonAdaptationIndex
from Bio.Seq import Seq

# Calculate CAI (requires reference codon usage)
cai = CodonAdaptationIndex()

# Generate index from reference sequences
# cai.generate_index("reference_genes.fasta")

# Calculate CAI for a sequence
# score = cai.cai_for_gene(str(my_gene_seq))
```

## Protein Analysis

### Isoelectric Point

```python
from Bio.SeqUtils.ProtParam import ProteinAnalysis

protein_seq = "MAEGEITTFTALTEKFNLPPGNYKKPKLLYCSNG"
analysis = ProteinAnalysis(protein_seq)

pi = analysis.isoelectric_point()
print(f"pI: {pi:.2f}")
```

### Amino Acid Composition

```python
from Bio.SeqUtils.ProtParam import ProteinAnalysis

protein_seq = "MAEGEITTFTALTEKFNLPPGNYKKPKLLYCSNG"
analysis = ProteinAnalysis(protein_seq)

# Amino acid percentage
aa_percent = analysis.get_amino_acids_percent()
for aa, pct in sorted(aa_percent.items()):
    if pct > 0:
        print(f"{aa}: {pct:.1%}")
```

### Protein Properties

```python
from Bio.SeqUtils.ProtParam import ProteinAnalysis

protein_seq = "MAEGEITTFTALTEKFNLPPGNYKKPKLLYCSNG"
analysis = ProteinAnalysis(protein_seq)

# Molecular weight
mw = analysis.molecular_weight()
print(f"MW: {mw:.2f} Da")

# Aromaticity
aromaticity = analysis.aromaticity()
print(f"Aromaticity: {aromaticity:.3f}")

# Instability index
instability = analysis.instability_index()
print(f"Instability index: {instability:.2f}")

# GRAVY (Grand average of hydropathicity)
gravy = analysis.gravy()
print(f"GRAVY: {gravy:.3f}")

# Secondary structure fraction
helix, turn, sheet = analysis.secondary_structure_fraction()
print(f"Helix: {helix:.1%}, Turn: {turn:.1%}, Sheet: {sheet:.1%}")
```

## Restriction Analysis

### Find Restriction Sites

```python
from Bio.Restriction import RestrictionBatch, EcoRI, BamHI, HindIII
from Bio.Seq import Seq

seq = Seq("ATGAATTCGATCGGATCCAAGCTTGCATGC")

# Single enzyme
sites = EcoRI.search(seq)
print(f"EcoRI sites: {sites}")

# Multiple enzymes
rb = RestrictionBatch([EcoRI, BamHI, HindIII])
result = rb.search(seq)
for enzyme, positions in result.items():
    if positions:
        print(f"{enzyme}: {positions}")
```

### Common Restriction Enzymes

| Enzyme | Recognition Sequence |
|--------|---------------------|
| EcoRI | G^AATTC |
| BamHI | G^GATCC |
| HindIII | A^AGCTT |
| NotI | GC^GGCCGC |
| XhoI | C^TCGAG |

## Sequence Statistics Summary

```python
from Bio.Seq import Seq
from Bio.SeqUtils import gc_fraction, molecular_weight
from Bio.SeqUtils import MeltingTemp as mt

def sequence_stats(seq_str):
    """Calculate common sequence statistics."""
    seq = Seq(seq_str)

    return {
        "length": len(seq),
        "gc_content": gc_fraction(seq),
        "molecular_weight": molecular_weight(seq, seq_type="DNA"),
        "tm_nn": mt.Tm_NN(seq),
        "a_count": seq.count("A"),
        "t_count": seq.count("T"),
        "g_count": seq.count("G"),
        "c_count": seq.count("C"),
    }

# Usage
stats = sequence_stats("ATGCGATCGATCG")
for key, value in stats.items():
    print(f"{key}: {value}")
```

```

### references/formats.md

```markdown
# Sequence File Formats

## FASTA Format

### Structure

```
>sequence_id description
SEQUENCE_LINE_1
SEQUENCE_LINE_2
...
```

### Example

```
>NM_001301717.2 Homo sapiens tumor protein p53 (TP53)
ATGGAGGAGCCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCA
GACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATG
GATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCA
```

### Key Points

- Header line starts with `>`
- Sequence ID is the first word after `>`
- Description follows the ID (space-separated)
- Sequence lines typically 60-80 characters
- No spaces within sequence
- Multiple sequences concatenated

### Parsing with Biopython

```python
from Bio import SeqIO

for record in SeqIO.parse("file.fasta", "fasta"):
    print(f"ID: {record.id}")
    print(f"Description: {record.description}")
    print(f"Sequence: {record.seq[:50]}...")
```

## GenBank Format

### Structure

```
LOCUS       identifier    length bp    type    division   date
DEFINITION  description
ACCESSION   accession_number
VERSION     version
KEYWORDS    keywords
SOURCE      organism
  ORGANISM  full taxonomy
FEATURES    Location/Qualifiers
     source  1..length
             /organism="..."
             /mol_type="..."
     gene    start..end
             /gene="gene_name"
     CDS     start..end
             /gene="gene_name"
             /product="protein_name"
             /translation="PROTEIN_SEQUENCE"
ORIGIN
        1 atgcgatcga tcgatcgatc gatcgatcga tcgatcgatc gatcgatcga
       61 tcgatcgatc gatcgatcga tcgatcgatc
//
```

### Key Sections

| Section | Description |
|---------|-------------|
| LOCUS | Identifier, length, type, date |
| DEFINITION | Sequence description |
| ACCESSION | Database accession number |
| VERSION | Accession with version |
| FEATURES | Annotations (genes, CDS, etc.) |
| ORIGIN | The actual sequence |

### Feature Types

| Type | Description |
|------|-------------|
| source | Full sequence source info |
| gene | Gene region |
| CDS | Coding sequence |
| mRNA | Messenger RNA |
| exon | Exon region |
| intron | Intron region |
| misc_feature | Other features |

### Parsing with Biopython

```python
from Bio import SeqIO

record = SeqIO.read("file.gb", "genbank")

# Access annotations
print(record.annotations["organism"])
print(record.annotations["taxonomy"])

# Access features
for feature in record.features:
    if feature.type == "CDS":
        print(f"CDS: {feature.location}")
        if "product" in feature.qualifiers:
            print(f"  Product: {feature.qualifiers['product'][0]}")
```

## FASTQ Format

### Structure

```
@sequence_id description
SEQUENCE
+
QUALITY_SCORES
```

### Example

```
@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=72
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACCAAGTTACCCTTAACAACTTAAGGGTTTTCAAATAGA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9ICIIIIIIIIIIIIIIIIDIIIIIII>IIIIII/
```

### Quality Score Encoding

**Phred+33 (Sanger/Illumina 1.8+):**
- ASCII 33 (`!`) = Q0
- ASCII 43 (`+`) = Q10
- ASCII 53 (`5`) = Q20
- ASCII 63 (`?`) = Q30
- ASCII 73 (`I`) = Q40

**Quality Score Meaning:**
- Q10 = 10% error rate (1 in 10)
- Q20 = 1% error rate (1 in 100)
- Q30 = 0.1% error rate (1 in 1000)
- Q40 = 0.01% error rate (1 in 10000)

### Parsing with Biopython

```python
from Bio import SeqIO

for record in SeqIO.parse("file.fastq", "fastq"):
    print(f"ID: {record.id}")
    print(f"Sequence: {record.seq}")
    print(f"Quality: {record.letter_annotations['phred_quality']}")
```

### Parsing with pysam

```python
import pysam

for read in pysam.FastxFile("file.fastq"):
    print(f"Name: {read.name}")
    print(f"Sequence: {read.sequence}")
    print(f"Quality: {read.quality}")
    print(f"Quality array: {read.get_quality_array()}")
```

## Format Detection

### By Extension

| Extension | Format |
|-----------|--------|
| `.fasta`, `.fa`, `.fna`, `.faa` | FASTA |
| `.fastq`, `.fq` | FASTQ |
| `.gb`, `.gbk`, `.genbank` | GenBank |
| `.embl` | EMBL |

### By Content

```python
def detect_format(filepath):
    """Detect sequence file format from content."""
    with open(filepath, 'r') as f:
        first_char = f.read(1)

    if first_char == '>':
        return 'fasta'
    elif first_char == '@':
        return 'fastq'
    elif first_char == 'L':  # LOCUS
        return 'genbank'
    else:
        return 'unknown'
```

## Format Conversion

### Common Conversions

```python
from Bio import SeqIO

# GenBank to FASTA
SeqIO.convert("input.gb", "genbank", "output.fasta", "fasta")

# FASTQ to FASTA (loses quality)
SeqIO.convert("input.fastq", "fastq", "output.fasta", "fasta")

# FASTA to tab-delimited
SeqIO.convert("input.fasta", "fasta", "output.tab", "tab")
```

### Preserve Annotations

```python
from Bio import SeqIO

# Read with full annotations
records = list(SeqIO.parse("input.gb", "genbank"))

# Write to EMBL (preserves more than FASTA)
SeqIO.write(records, "output.embl", "embl")
```

## Compressed Files

### Reading Compressed Files

```python
import gzip
from Bio import SeqIO

# GZIP compressed
with gzip.open("file.fasta.gz", "rt") as handle:
    for record in SeqIO.parse(handle, "fasta"):
        print(record.id)
```

### With pysam

```python
import pysam

# pysam handles compression automatically
for read in pysam.FastxFile("file.fastq.gz"):
    print(read.name)
```

## Validation

### Check FASTA Validity

```python
def validate_fasta(filepath):
    """Basic FASTA validation."""
    from Bio import SeqIO

    try:
        records = list(SeqIO.parse(filepath, "fasta"))
        if not records:
            return False, "No records found"

        for record in records:
            if not record.id:
                return False, "Record without ID"
            if not record.seq:
                return False, f"Empty sequence: {record.id}"

        return True, f"Valid FASTA with {len(records)} records"

    except Exception as e:
        return False, str(e)
```

### Check FASTQ Validity

```python
def validate_fastq(filepath):
    """Basic FASTQ validation."""
    from Bio import SeqIO

    try:
        count = 0
        for record in SeqIO.parse(filepath, "fastq"):
            count += 1
            seq_len = len(record.seq)
            qual_len = len(record.letter_annotations["phred_quality"])

            if seq_len != qual_len:
                return False, f"Length mismatch in {record.id}"

        return True, f"Valid FASTQ with {count} records"

    except Exception as e:
        return False, str(e)
```

```