bio-fasta
Read/write FASTA, GenBank, FASTQ files. Sequence manipulation (complement, translate). Indexed random access via faidx. For NGS pipelines (SAM/BAM/VCF), use pysam. For BLAST, use gget or blat-integration.
Packaged view
This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.
Install command
npx @skill-hub/cli install benchflow-ai-skillsbench-bio-seq
Repository
Skill path: registry/terminal_bench_2.0/full_batch_reviewed/terminal_bench_2_0_protein-assembly/environment/skills/bio-seq
Read/write FASTA, GenBank, FASTQ files. Sequence manipulation (complement, translate). Indexed random access via faidx. For NGS pipelines (SAM/BAM/VCF), use pysam. For BLAST, use gget or blat-integration.
Open repositoryBest for
Primary workflow: Ship Full Stack.
Technical facets: Full Stack, Integration.
Target audience: everyone.
License: Unknown.
Original source
Catalog source: SkillHub Club.
Repository owner: benchflow-ai.
This is still a mirrored public skill entry. Review the repository before installing into production workflows.
What it helps with
- Install bio-fasta into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
- Review https://github.com/benchflow-ai/SkillsBench before adding bio-fasta to shared team environments
- Use bio-fasta for development workflows
Works across
Favorites: 0.
Sub-skills: 0.
Aggregator: No.
Original source / Raw SKILL.md
---
name: bio-fasta
description: "Read/write FASTA, GenBank, FASTQ files. Sequence manipulation (complement, translate). Indexed random access via faidx. For NGS pipelines (SAM/BAM/VCF), use pysam. For BLAST, use gget or blat-integration."
user_invocable: true
---
# Sequence I/O
Read, write, and manipulate biological sequence files (FASTA, GenBank, FASTQ).
## When to Use This Skill
This skill should be used when:
- Reading or writing sequence files (FASTA, GenBank, FASTQ)
- Converting between sequence file formats
- Manipulating sequences (complement, reverse complement, translate)
- Extracting sequences from large indexed FASTA files (faidx)
- Calculating sequence statistics (GC content, molecular weight, Tm)
## When NOT to Use This Skill
- **NGS alignment files (SAM/BAM/VCF)** → Use `pysam`
- **BLAST searches** → Use `gget` (quick) or `blat-integration` (large-scale)
- **Multiple sequence alignment** → Use `msa-advanced`
- **Phylogenetic analysis** → Use `etetoolkit`
- **NCBI database queries** → Use `pubmed-database` or `gene-database`
## Tool Selection Guide
| Task | Tool | Reference |
|------|------|-----------|
| Parse FASTA/GenBank/FASTQ | `Bio.SeqIO` | `biopython_seqio.md` |
| Convert file formats | `Bio.SeqIO.convert()` | `biopython_seqio.md` |
| Sequence operations | `Bio.Seq` | `biopython_seqio.md` |
| Large FASTA random access | `pysam.FastaFile` + faidx | `faidx.md` |
| GC%, Tm, molecular weight | `Bio.SeqUtils` | `utilities.md` |
## Quick Start
### Installation
```bash
uv pip install biopython pysam
```
### Read FASTA
```python
from Bio import SeqIO
for record in SeqIO.parse("sequences.fasta", "fasta"):
print(f"{record.id}: {len(record.seq)} bp")
```
### Convert GenBank to FASTA
```python
from Bio import SeqIO
SeqIO.convert("input.gb", "genbank", "output.fasta", "fasta")
```
### Random Access with faidx
```python
import pysam
# Create index (once)
pysam.faidx("reference.fasta")
# Random access
fasta = pysam.FastaFile("reference.fasta")
seq = fasta.fetch("chr1", 1000, 2000) # 0-based coordinates
fasta.close()
```
### Sequence Operations
```python
from Bio.Seq import Seq
seq = Seq("ATGCGATCGATCG")
print(seq.complement())
print(seq.reverse_complement())
print(seq.translate())
```
## Reference Documentation
Consult the appropriate reference file for detailed documentation:
### `references/biopython_seqio.md`
- `Bio.Seq` object and sequence operations
- `Bio.SeqIO` for file parsing and writing
- `SeqRecord` object and annotations
- Supported file formats
- Format conversion patterns
### `references/faidx.md`
- Creating FASTA index with `pysam.faidx()`
- `pysam.FastaFile` for random access
- Coordinate systems (0-based vs 1-based)
- Performance considerations for large files
- Common patterns (variant context, gene extraction)
### `references/utilities.md`
- GC content calculation (`gc_fraction`)
- Molecular weight (`molecular_weight`)
- Melting temperature (`MeltingTemp`)
- Codon usage analysis
- Restriction enzyme sites
### `references/formats.md`
- FASTA format specification
- GenBank format specification
- FASTQ format and quality scores
- Format detection and validation
## Coordinate Systems
**Biopython**: Uses Python-style 0-based, half-open intervals for slicing.
**pysam.FastaFile.fetch()**:
- Numeric arguments: 0-based (`fetch("chr1", 999, 2000)` = positions 999-1999)
- Region strings: 1-based (`fetch("chr1:1000-2000")` = positions 1000-2000)
## Common Pitfalls
1. **Coordinate confusion**: Remember which tool uses 0-based vs 1-based
2. **Missing faidx index**: Random access requires `.fai` file
3. **Format mismatch**: Verify file format matches the format string in `SeqIO.parse()`
4. **Iterator exhaustion**: `SeqIO.parse()` returns an iterator; convert to list if multiple passes needed
5. **Large files**: Use iterators, not `list()`, for memory efficiency
---
## Referenced Files
> The following files are referenced in this skill and included for context.
### references/biopython_seqio.md
```markdown
# Biopython Sequence Handling (Bio.Seq & Bio.SeqIO)
## Bio.Seq: The Seq Object
### Creating Sequences
```python
from Bio.Seq import Seq
# Create a basic sequence
my_seq = Seq("AGTACACTGGT")
# Sequences support string-like operations
print(len(my_seq)) # Length
print(my_seq[0:5]) # Slicing
print(my_seq.upper()) # Case conversion
```
### Core Sequence Operations
```python
from Bio.Seq import Seq
seq = Seq("ATGCGATCGATCG")
# Complement and reverse complement
complement = seq.complement()
rev_comp = seq.reverse_complement()
# Transcription (DNA to RNA)
rna = seq.transcribe()
# Translation (to protein)
protein = seq.translate()
# Back-transcription (RNA to DNA)
dna = rna.back_transcribe()
```
### Sequence Methods Reference
| Method | Description |
|--------|-------------|
| `complement()` | Returns complementary strand |
| `reverse_complement()` | Returns reverse complement |
| `transcribe()` | DNA to RNA transcription |
| `back_transcribe()` | RNA to DNA conversion |
| `translate()` | Translate to protein sequence |
| `translate(table=N)` | Use specific genetic code table |
| `translate(to_stop=True)` | Stop at first stop codon |
### Translation Options
```python
# Standard translation
protein = seq.translate()
# Using alternative genetic code (e.g., mitochondrial)
protein = seq.translate(table=2)
# Stop at first stop codon
protein = seq.translate(to_stop=True)
# Include stop codon as '*'
protein = seq.translate(stop_symbol="*")
```
## Bio.SeqIO: Sequence File I/O
### Core Functions
#### SeqIO.parse() - Read Multiple Records
Primary function for reading sequence files as an iterator of `SeqRecord` objects.
```python
from Bio import SeqIO
# Parse a FASTA file
for record in SeqIO.parse("sequences.fasta", "fasta"):
print(record.id)
print(record.seq)
print(len(record))
```
#### SeqIO.read() - Read Single Record
For files containing exactly one record.
```python
record = SeqIO.read("single.fasta", "fasta")
```
#### SeqIO.write() - Write Records
Output SeqRecord objects to files.
```python
from Bio import SeqIO
# Write records to file
records = [record1, record2, record3]
count = SeqIO.write(records, "output.fasta", "fasta")
print(f"Wrote {count} records")
```
#### SeqIO.convert() - Format Conversion
Streamlined format conversion.
```python
# Convert between formats
count = SeqIO.convert("input.gbk", "genbank", "output.fasta", "fasta")
```
#### SeqIO.index() - Random Access by ID
Create dictionary-like access to records by identifier.
```python
# Index a file for random access
record_dict = SeqIO.index("sequences.fasta", "fasta")
# Access by ID
record = record_dict["seq_id_123"]
# Close when done
record_dict.close()
```
#### SeqIO.to_dict() - Load All to Memory
Load all records into a dictionary (for small files only).
```python
# Load all records into memory
records = SeqIO.to_dict(SeqIO.parse("small.fasta", "fasta"))
record = records["seq_id"]
```
### Supported File Formats
| Format | Description |
|--------|-------------|
| `fasta` | FASTA format |
| `fastq` | FASTQ format (with quality scores) |
| `genbank` or `gb` | GenBank format |
| `embl` | EMBL format |
| `swiss` | SwissProt format |
| `fasta-2line` | FASTA with sequence on one line |
| `tab` | Simple tab-separated format |
| `qual` | Quality scores only |
| `pir` | PIR/NBRF format |
| `clustal` | Clustal alignment format |
| `phylip` | PHYLIP format |
| `stockholm` | Stockholm alignment format |
### The SeqRecord Object
`SeqRecord` combines sequence data with annotations.
```python
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
# Create a SeqRecord
record = SeqRecord(
Seq("ATGCGATCGATCG"),
id="my_sequence",
name="MySeq",
description="Example sequence"
)
# Access attributes
print(record.id) # Identifier
print(record.seq) # Sequence (Seq object)
print(record.name) # Name
print(record.description) # Description
print(record.annotations) # Dictionary of annotations
print(record.features) # List of SeqFeature objects
print(record.letter_annotations) # Per-letter annotations
```
### Common Patterns
#### Filter Sequences by Length
```python
from Bio import SeqIO
def filter_by_length(input_file, output_file, min_length=100):
"""Keep only sequences longer than min_length."""
records = (r for r in SeqIO.parse(input_file, "fasta")
if len(r.seq) >= min_length)
count = SeqIO.write(records, output_file, "fasta")
return count
```
#### Extract Subsequences
```python
from Bio import SeqIO
for record in SeqIO.parse("sequences.fasta", "fasta"):
# Get first 100 bases
subseq = record.seq[:100]
# Create new record with subsequence
new_record = record[:100] # Preserves annotations
```
#### Batch Processing
```python
from Bio import SeqIO
def process_in_batches(input_file, batch_size=1000):
"""Process sequences in batches."""
batch = []
for record in SeqIO.parse(input_file, "fasta"):
batch.append(record)
if len(batch) >= batch_size:
yield batch
batch = []
if batch:
yield batch
```
#### Modify and Write
```python
from Bio import SeqIO
def add_prefix_to_ids(input_file, output_file, prefix):
"""Add prefix to all sequence IDs."""
records = []
for record in SeqIO.parse(input_file, "fasta"):
record.id = f"{prefix}_{record.id}"
record.description = "" # Clear to avoid duplication
records.append(record)
SeqIO.write(records, output_file, "fasta")
```
## Working with GenBank Files
GenBank files contain rich annotations.
```python
from Bio import SeqIO
record = SeqIO.read("sequence.gb", "genbank")
# Access annotations
print(record.annotations["organism"])
print(record.annotations["taxonomy"])
# Iterate through features
for feature in record.features:
if feature.type == "CDS":
print(f"CDS: {feature.location}")
print(f" Product: {feature.qualifiers.get('product', ['N/A'])[0]}")
print(f" Translation: {feature.qualifiers.get('translation', ['N/A'])[0][:50]}...")
```
### Extract CDS Sequences
```python
from Bio import SeqIO
record = SeqIO.read("genome.gb", "genbank")
for feature in record.features:
if feature.type == "CDS":
# Extract DNA sequence for this CDS
cds_seq = feature.extract(record.seq)
# Get protein sequence from annotation
if "translation" in feature.qualifiers:
protein = feature.qualifiers["translation"][0]
```
## Working with FASTQ Files
FASTQ files include quality scores.
```python
from Bio import SeqIO
for record in SeqIO.parse("reads.fastq", "fastq"):
print(f"ID: {record.id}")
print(f"Seq: {record.seq}")
print(f"Qual: {record.letter_annotations['phred_quality']}")
```
### Quality Filtering
```python
from Bio import SeqIO
def filter_by_quality(input_file, output_file, min_quality=20):
"""Keep reads with mean quality >= min_quality."""
def mean_quality(record):
return sum(record.letter_annotations["phred_quality"]) / len(record)
good_reads = (r for r in SeqIO.parse(input_file, "fastq")
if mean_quality(r) >= min_quality)
count = SeqIO.write(good_reads, output_file, "fastq")
return count
```
## Best Practices
1. **Use iterators for large files** - Avoid `list(SeqIO.parse(...))` for large files
2. **Use SeqIO.index() for random access** - More memory efficient than to_dict()
3. **Close file handles** - Use context managers or explicit close()
4. **Validate format** - Check file format matches the format string
5. **Handle missing data** - Not all records have all fields
```
### references/faidx.md
```markdown
# Indexed FASTA Access with faidx
## Overview
For large FASTA files, random access by genomic coordinates requires an index file (`.fai`). The `pysam` library provides `FastaFile` class for efficient indexed access.
## Creating a FASTA Index
```python
import pysam
# Create index using pysam
pysam.faidx("reference.fasta")
# Or using samtools command
pysam.samtools.faidx("reference.fasta")
```
This creates a `.fai` index file required for random access.
### Index File Format
The `.fai` file is a tab-delimited text file with columns:
| Column | Description |
|--------|-------------|
| NAME | Sequence name |
| LENGTH | Sequence length in bases |
| OFFSET | Byte offset of first base |
| LINEBASES | Bases per line |
| LINEWIDTH | Bytes per line (including newline) |
## Opening FASTA Files
```python
import pysam
# Open indexed FASTA file
fasta = pysam.FastaFile("reference.fasta")
# Automatically looks for reference.fasta.fai index
```
## FastaFile Properties
```python
fasta = pysam.FastaFile("reference.fasta")
# List of reference sequences
references = fasta.references
print(f"References: {references}")
# Get lengths
lengths = fasta.lengths
print(f"Lengths: {lengths}")
# Get specific sequence length
chr1_length = fasta.get_reference_length("chr1")
# Number of references
n_refs = fasta.nreferences
# Close when done
fasta.close()
```
## Fetching Sequences
### Coordinate Systems
**Critical:** pysam uses **0-based, half-open** coordinates (Python convention):
- Start positions are 0-based (first base is position 0)
- End positions are exclusive (not included in the range)
- Region 1000-2000 includes bases 1000-1999 (1000 bases total)
### Fetch by Numeric Coordinates (0-based)
```python
# Fetch specific region (0-based)
sequence = fasta.fetch("chr1", 1000, 2000)
print(f"Sequence: {sequence}") # Returns 1000 bases (positions 1000-1999)
# Fetch entire chromosome
chr1_seq = fasta.fetch("chr1")
```
### Fetch by Region String (1-based)
```python
# Fetch using region string (1-based, samtools convention)
sequence = fasta.fetch(region="chr1:1001-2000")
# Equivalent to:
sequence = fasta.fetch("chr1", 1000, 2000) # 0-based
```
## Common Patterns
### Get Sequence Context Around a Variant
```python
def get_variant_context(fasta, chrom, pos, window=10):
"""Get sequence context around a variant position (1-based input)."""
start = max(0, pos - window - 1) # Convert to 0-based
end = pos + window
return fasta.fetch(chrom, start, end)
# Usage
context = get_variant_context(fasta, "chr1", 12345, window=20)
```
### Get Gene Sequence with Strand Awareness
```python
def get_gene_sequence(fasta, chrom, start, end, strand):
"""Get gene sequence with strand awareness (0-based input)."""
seq = fasta.fetch(chrom, start, end)
if strand == "-":
# Reverse complement
complement = str.maketrans("ATGCatgc", "TACGtacg")
seq = seq.translate(complement)[::-1]
return seq
# Usage
gene_seq = get_gene_sequence(fasta, "chr1", 1000, 2000, "-")
```
### Verify Reference Allele at Position
```python
def check_ref_allele(fasta, chrom, pos, expected_ref):
"""Verify reference allele at position (1-based pos)."""
actual = fasta.fetch(chrom, pos - 1, pos) # Convert to 0-based
return actual.upper() == expected_ref.upper()
# Usage
is_correct = check_ref_allele(fasta, "chr1", 12345, "A")
```
### Extract Multiple Regions
```python
# Extract multiple regions efficiently
regions = [
("chr1", 1000, 2000),
("chr1", 5000, 6000),
("chr2", 10000, 11000)
]
sequences = {}
for chrom, start, end in regions:
seq_id = f"{chrom}:{start}-{end}"
sequences[seq_id] = fasta.fetch(chrom, start, end)
```
### Check N Content
```python
def has_quality_sequence(fasta, chrom, start, end, max_n_frac=0.1):
"""Check if region has acceptable N content."""
seq = fasta.fetch(chrom, start, end)
n_count = seq.upper().count('N')
return (n_count / len(seq)) <= max_n_frac
# Usage
is_good = has_quality_sequence(fasta, "chr1", 1000, 2000, max_n_frac=0.05)
```
## Working with Ambiguous Bases
FASTA files may contain IUPAC ambiguity codes:
| Code | Meaning |
|------|---------|
| N | Any base |
| R | A or G (purine) |
| Y | C or T (pyrimidine) |
| S | G or C (strong) |
| W | A or T (weak) |
| K | G or T (keto) |
| M | A or C (amino) |
| B | C, G, or T (not A) |
| D | A, G, or T (not C) |
| H | A, C, or T (not G) |
| V | A, C, or G (not T) |
```python
def count_ambiguous(sequence):
"""Count non-ATGC bases."""
return sum(1 for base in sequence.upper() if base not in "ATGC")
```
## Context Manager Usage
```python
import pysam
# Using context manager (recommended)
with pysam.FastaFile("reference.fasta") as fasta:
seq = fasta.fetch("chr1", 1000, 2000)
# File automatically closed when exiting block
```
## Performance Tips
1. **Always use indexed FASTA** - Create `.fai` with `pysam.faidx()`
2. **Batch fetch operations** - Minimize open/close cycles
3. **Cache frequently accessed sequences** - Store in memory if reused
4. **Use appropriate window sizes** - Avoid loading excessive data
5. **Close files explicitly** - Free resources when done
## Common Pitfalls
1. **Coordinate confusion** - Remember 0-based numeric vs 1-based string
2. **Missing index** - Random access requires `.fai` file
3. **Case sensitivity** - FASTA preserves case; use `.upper()` for consistent comparison
4. **Chromosome naming** - Ensure names match exactly (e.g., "chr1" vs "1")
5. **Out of bounds** - Fetching beyond sequence length returns truncated result
## Comparison with Bio.SeqIO
| Feature | pysam.FastaFile | Bio.SeqIO |
|---------|-----------------|-----------|
| Random access | Yes (with index) | Yes (with SeqIO.index) |
| Memory usage | Low (indexed) | Varies |
| Coordinate input | Numeric (0-based) | N/A (by ID only) |
| Best for | Large genomes, region queries | Format conversion, annotations |
```
### references/utilities.md
```markdown
# Sequence Utilities (Bio.SeqUtils)
## Overview
Biopython's `Bio.SeqUtils` module provides utilities for calculating sequence statistics including GC content, molecular weight, melting temperature, and codon usage.
## GC Content
### Basic GC Calculation
```python
from Bio.SeqUtils import gc_fraction
from Bio.Seq import Seq
seq = Seq("ATGCGATCGATCG")
gc = gc_fraction(seq)
print(f"GC content: {gc:.2%}") # Output: GC content: 53.85%
```
### GC Content Variants
```python
from Bio.SeqUtils import gc_fraction
# Standard GC fraction (G+C)/(A+T+G+C)
gc = gc_fraction(seq)
# For ambiguous bases, use:
from Bio.SeqUtils import GC # Deprecated but handles ambiguous
```
### GC Skew
```python
def gc_skew(seq):
"""Calculate GC skew: (G-C)/(G+C)"""
g = seq.upper().count('G')
c = seq.upper().count('C')
if g + c == 0:
return 0
return (g - c) / (g + c)
```
## Molecular Weight
### DNA Molecular Weight
```python
from Bio.SeqUtils import molecular_weight
from Bio.Seq import Seq
dna_seq = Seq("ATGCGATCGATCG")
# Single-stranded DNA
mw_ss = molecular_weight(dna_seq, seq_type="DNA")
print(f"MW (ssDNA): {mw_ss:.2f} Da")
# Double-stranded DNA
mw_ds = molecular_weight(dna_seq, seq_type="DNA", double_stranded=True)
print(f"MW (dsDNA): {mw_ds:.2f} Da")
```
### RNA Molecular Weight
```python
from Bio.SeqUtils import molecular_weight
from Bio.Seq import Seq
rna_seq = Seq("AUGCGAUCGAUCG")
mw = molecular_weight(rna_seq, seq_type="RNA")
print(f"MW (RNA): {mw:.2f} Da")
```
### Protein Molecular Weight
```python
from Bio.SeqUtils import molecular_weight
from Bio.Seq import Seq
protein_seq = Seq("MRHILK")
mw = molecular_weight(protein_seq, seq_type="protein")
print(f"MW (protein): {mw:.2f} Da")
```
## Melting Temperature (Tm)
### Basic Tm Calculation
```python
from Bio.SeqUtils import MeltingTemp as mt
from Bio.Seq import Seq
seq = Seq("ATGCGATCGATCG")
# Wallace rule (simple, for short oligos)
tm_wallace = mt.Tm_Wallace(seq)
# GC method
tm_gc = mt.Tm_GC(seq)
# Nearest-neighbor method (most accurate)
tm_nn = mt.Tm_NN(seq)
print(f"Tm (Wallace): {tm_wallace:.1f}°C")
print(f"Tm (GC): {tm_gc:.1f}°C")
print(f"Tm (NN): {tm_nn:.1f}°C")
```
### Nearest-Neighbor with Custom Parameters
```python
from Bio.SeqUtils import MeltingTemp as mt
from Bio.Seq import Seq
seq = Seq("ATGCGATCGATCG")
# With salt concentration
tm = mt.Tm_NN(seq, Na=50, K=0, Tris=0, Mg=0, dNTPs=0)
# With primer concentration
tm = mt.Tm_NN(seq, dnac1=250, dnac2=0) # nM concentrations
print(f"Tm: {tm:.1f}°C")
```
### Tm Methods Comparison
| Method | Use Case | Accuracy |
|--------|----------|----------|
| `Tm_Wallace` | Quick estimate, short oligos (<14 bp) | Low |
| `Tm_GC` | General purpose, medium oligos | Medium |
| `Tm_NN` | PCR primers, accurate calculation | High |
## Codon Usage
### Count Codons
```python
from Bio.SeqUtils import CodonUsage
from Bio.Seq import Seq
# Create codon usage index
cds = Seq("ATGAAACCCGGGTTTTAA")
codon_count = {}
for i in range(0, len(cds) - 2, 3):
codon = str(cds[i:i+3])
codon_count[codon] = codon_count.get(codon, 0) + 1
print(codon_count)
```
### Codon Adaptation Index (CAI)
```python
from Bio.SeqUtils.CodonUsage import CodonAdaptationIndex
from Bio.Seq import Seq
# Calculate CAI (requires reference codon usage)
cai = CodonAdaptationIndex()
# Generate index from reference sequences
# cai.generate_index("reference_genes.fasta")
# Calculate CAI for a sequence
# score = cai.cai_for_gene(str(my_gene_seq))
```
## Protein Analysis
### Isoelectric Point
```python
from Bio.SeqUtils.ProtParam import ProteinAnalysis
protein_seq = "MAEGEITTFTALTEKFNLPPGNYKKPKLLYCSNG"
analysis = ProteinAnalysis(protein_seq)
pi = analysis.isoelectric_point()
print(f"pI: {pi:.2f}")
```
### Amino Acid Composition
```python
from Bio.SeqUtils.ProtParam import ProteinAnalysis
protein_seq = "MAEGEITTFTALTEKFNLPPGNYKKPKLLYCSNG"
analysis = ProteinAnalysis(protein_seq)
# Amino acid percentage
aa_percent = analysis.get_amino_acids_percent()
for aa, pct in sorted(aa_percent.items()):
if pct > 0:
print(f"{aa}: {pct:.1%}")
```
### Protein Properties
```python
from Bio.SeqUtils.ProtParam import ProteinAnalysis
protein_seq = "MAEGEITTFTALTEKFNLPPGNYKKPKLLYCSNG"
analysis = ProteinAnalysis(protein_seq)
# Molecular weight
mw = analysis.molecular_weight()
print(f"MW: {mw:.2f} Da")
# Aromaticity
aromaticity = analysis.aromaticity()
print(f"Aromaticity: {aromaticity:.3f}")
# Instability index
instability = analysis.instability_index()
print(f"Instability index: {instability:.2f}")
# GRAVY (Grand average of hydropathicity)
gravy = analysis.gravy()
print(f"GRAVY: {gravy:.3f}")
# Secondary structure fraction
helix, turn, sheet = analysis.secondary_structure_fraction()
print(f"Helix: {helix:.1%}, Turn: {turn:.1%}, Sheet: {sheet:.1%}")
```
## Restriction Analysis
### Find Restriction Sites
```python
from Bio.Restriction import RestrictionBatch, EcoRI, BamHI, HindIII
from Bio.Seq import Seq
seq = Seq("ATGAATTCGATCGGATCCAAGCTTGCATGC")
# Single enzyme
sites = EcoRI.search(seq)
print(f"EcoRI sites: {sites}")
# Multiple enzymes
rb = RestrictionBatch([EcoRI, BamHI, HindIII])
result = rb.search(seq)
for enzyme, positions in result.items():
if positions:
print(f"{enzyme}: {positions}")
```
### Common Restriction Enzymes
| Enzyme | Recognition Sequence |
|--------|---------------------|
| EcoRI | G^AATTC |
| BamHI | G^GATCC |
| HindIII | A^AGCTT |
| NotI | GC^GGCCGC |
| XhoI | C^TCGAG |
## Sequence Statistics Summary
```python
from Bio.Seq import Seq
from Bio.SeqUtils import gc_fraction, molecular_weight
from Bio.SeqUtils import MeltingTemp as mt
def sequence_stats(seq_str):
"""Calculate common sequence statistics."""
seq = Seq(seq_str)
return {
"length": len(seq),
"gc_content": gc_fraction(seq),
"molecular_weight": molecular_weight(seq, seq_type="DNA"),
"tm_nn": mt.Tm_NN(seq),
"a_count": seq.count("A"),
"t_count": seq.count("T"),
"g_count": seq.count("G"),
"c_count": seq.count("C"),
}
# Usage
stats = sequence_stats("ATGCGATCGATCG")
for key, value in stats.items():
print(f"{key}: {value}")
```
```
### references/formats.md
```markdown
# Sequence File Formats
## FASTA Format
### Structure
```
>sequence_id description
SEQUENCE_LINE_1
SEQUENCE_LINE_2
...
```
### Example
```
>NM_001301717.2 Homo sapiens tumor protein p53 (TP53)
ATGGAGGAGCCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCA
GACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATG
GATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCA
```
### Key Points
- Header line starts with `>`
- Sequence ID is the first word after `>`
- Description follows the ID (space-separated)
- Sequence lines typically 60-80 characters
- No spaces within sequence
- Multiple sequences concatenated
### Parsing with Biopython
```python
from Bio import SeqIO
for record in SeqIO.parse("file.fasta", "fasta"):
print(f"ID: {record.id}")
print(f"Description: {record.description}")
print(f"Sequence: {record.seq[:50]}...")
```
## GenBank Format
### Structure
```
LOCUS identifier length bp type division date
DEFINITION description
ACCESSION accession_number
VERSION version
KEYWORDS keywords
SOURCE organism
ORGANISM full taxonomy
FEATURES Location/Qualifiers
source 1..length
/organism="..."
/mol_type="..."
gene start..end
/gene="gene_name"
CDS start..end
/gene="gene_name"
/product="protein_name"
/translation="PROTEIN_SEQUENCE"
ORIGIN
1 atgcgatcga tcgatcgatc gatcgatcga tcgatcgatc gatcgatcga
61 tcgatcgatc gatcgatcga tcgatcgatc
//
```
### Key Sections
| Section | Description |
|---------|-------------|
| LOCUS | Identifier, length, type, date |
| DEFINITION | Sequence description |
| ACCESSION | Database accession number |
| VERSION | Accession with version |
| FEATURES | Annotations (genes, CDS, etc.) |
| ORIGIN | The actual sequence |
### Feature Types
| Type | Description |
|------|-------------|
| source | Full sequence source info |
| gene | Gene region |
| CDS | Coding sequence |
| mRNA | Messenger RNA |
| exon | Exon region |
| intron | Intron region |
| misc_feature | Other features |
### Parsing with Biopython
```python
from Bio import SeqIO
record = SeqIO.read("file.gb", "genbank")
# Access annotations
print(record.annotations["organism"])
print(record.annotations["taxonomy"])
# Access features
for feature in record.features:
if feature.type == "CDS":
print(f"CDS: {feature.location}")
if "product" in feature.qualifiers:
print(f" Product: {feature.qualifiers['product'][0]}")
```
## FASTQ Format
### Structure
```
@sequence_id description
SEQUENCE
+
QUALITY_SCORES
```
### Example
```
@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=72
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACCAAGTTACCCTTAACAACTTAAGGGTTTTCAAATAGA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9ICIIIIIIIIIIIIIIIIDIIIIIII>IIIIII/
```
### Quality Score Encoding
**Phred+33 (Sanger/Illumina 1.8+):**
- ASCII 33 (`!`) = Q0
- ASCII 43 (`+`) = Q10
- ASCII 53 (`5`) = Q20
- ASCII 63 (`?`) = Q30
- ASCII 73 (`I`) = Q40
**Quality Score Meaning:**
- Q10 = 10% error rate (1 in 10)
- Q20 = 1% error rate (1 in 100)
- Q30 = 0.1% error rate (1 in 1000)
- Q40 = 0.01% error rate (1 in 10000)
### Parsing with Biopython
```python
from Bio import SeqIO
for record in SeqIO.parse("file.fastq", "fastq"):
print(f"ID: {record.id}")
print(f"Sequence: {record.seq}")
print(f"Quality: {record.letter_annotations['phred_quality']}")
```
### Parsing with pysam
```python
import pysam
for read in pysam.FastxFile("file.fastq"):
print(f"Name: {read.name}")
print(f"Sequence: {read.sequence}")
print(f"Quality: {read.quality}")
print(f"Quality array: {read.get_quality_array()}")
```
## Format Detection
### By Extension
| Extension | Format |
|-----------|--------|
| `.fasta`, `.fa`, `.fna`, `.faa` | FASTA |
| `.fastq`, `.fq` | FASTQ |
| `.gb`, `.gbk`, `.genbank` | GenBank |
| `.embl` | EMBL |
### By Content
```python
def detect_format(filepath):
"""Detect sequence file format from content."""
with open(filepath, 'r') as f:
first_char = f.read(1)
if first_char == '>':
return 'fasta'
elif first_char == '@':
return 'fastq'
elif first_char == 'L': # LOCUS
return 'genbank'
else:
return 'unknown'
```
## Format Conversion
### Common Conversions
```python
from Bio import SeqIO
# GenBank to FASTA
SeqIO.convert("input.gb", "genbank", "output.fasta", "fasta")
# FASTQ to FASTA (loses quality)
SeqIO.convert("input.fastq", "fastq", "output.fasta", "fasta")
# FASTA to tab-delimited
SeqIO.convert("input.fasta", "fasta", "output.tab", "tab")
```
### Preserve Annotations
```python
from Bio import SeqIO
# Read with full annotations
records = list(SeqIO.parse("input.gb", "genbank"))
# Write to EMBL (preserves more than FASTA)
SeqIO.write(records, "output.embl", "embl")
```
## Compressed Files
### Reading Compressed Files
```python
import gzip
from Bio import SeqIO
# GZIP compressed
with gzip.open("file.fasta.gz", "rt") as handle:
for record in SeqIO.parse(handle, "fasta"):
print(record.id)
```
### With pysam
```python
import pysam
# pysam handles compression automatically
for read in pysam.FastxFile("file.fastq.gz"):
print(read.name)
```
## Validation
### Check FASTA Validity
```python
def validate_fasta(filepath):
"""Basic FASTA validation."""
from Bio import SeqIO
try:
records = list(SeqIO.parse(filepath, "fasta"))
if not records:
return False, "No records found"
for record in records:
if not record.id:
return False, "Record without ID"
if not record.seq:
return False, f"Empty sequence: {record.id}"
return True, f"Valid FASTA with {len(records)} records"
except Exception as e:
return False, str(e)
```
### Check FASTQ Validity
```python
def validate_fastq(filepath):
"""Basic FASTQ validation."""
from Bio import SeqIO
try:
count = 0
for record in SeqIO.parse(filepath, "fastq"):
count += 1
seq_len = len(record.seq)
qual_len = len(record.letter_annotations["phred_quality"])
if seq_len != qual_len:
return False, f"Length mismatch in {record.id}"
return True, f"Valid FASTQ with {count} records"
except Exception as e:
return False, str(e)
```
```