Back to skills
SkillHub ClubAnalyze Data & AIFull StackData / AISecurity

python-json-parsing

Python JSON parsing best practices covering performance optimization (orjson/msgspec), handling large files (streaming/JSONL), security (injection prevention), and advanced querying (JSONPath/JMESPath). Use when working with JSON data, parsing APIs, handling large JSON files, or optimizing JSON performance.

Packaged view

This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.

Stars
781
Hot score
99
Updated
March 20, 2026
Overall rating
C4.8
Composite score
4.8
Best-practice grade
B80.4

Install command

npx @skill-hub/cli install benchflow-ai-skillsbench-python-json-parsing

Repository

benchflow-ai/SkillsBench

Skill path: registry/terminal_bench_1.0/jsonl-aggregator/environment/skills/python-json-parsing

Python JSON parsing best practices covering performance optimization (orjson/msgspec), handling large files (streaming/JSONL), security (injection prevention), and advanced querying (JSONPath/JMESPath). Use when working with JSON data, parsing APIs, handling large JSON files, or optimizing JSON performance.

Open repository

Best for

Primary workflow: Analyze Data & AI.

Technical facets: Full Stack, Data / AI, Security.

Target audience: everyone.

License: Unknown.

Original source

Catalog source: SkillHub Club.

Repository owner: benchflow-ai.

This is still a mirrored public skill entry. Review the repository before installing into production workflows.

What it helps with

  • Install python-json-parsing into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
  • Review https://github.com/benchflow-ai/SkillsBench before adding python-json-parsing to shared team environments
  • Use python-json-parsing for development workflows

Works across

Claude CodeCodex CLIGemini CLIOpenCode

Favorites: 0.

Sub-skills: 0.

Aggregator: No.

Original source / Raw SKILL.md

---
name: python-json-parsing
description: >
  Python JSON parsing best practices covering performance optimization (orjson/msgspec),
  handling large files (streaming/JSONL), security (injection prevention),
  and advanced querying (JSONPath/JMESPath).
  Use when working with JSON data, parsing APIs, handling large JSON files,
  or optimizing JSON performance.
---

# Python JSON Parsing Best Practices

Comprehensive guide to JSON parsing in Python with focus on performance, security, and scalability.

## Quick Start

### Basic JSON Parsing

```python
import json

# Parse JSON string
data = json.loads('{"name": "Alice", "age": 30}')

# Parse JSON file
with open("data.json", "r", encoding="utf-8") as f:
    data = json.load(f)

# Write JSON file
with open("output.json", "w", encoding="utf-8") as f:
    json.dump(data, f, indent=2)
```

**Key Rule:** Always specify `encoding="utf-8"` when reading/writing files.

## When to Use This Skill

Use this skill when:

- Working with JSON APIs or data interchange
- Optimizing JSON performance in high-throughput applications
- Handling large JSON files (> 100MB)
- Securing applications against JSON injection
- Extracting data from complex nested JSON structures

## Performance: Choose the Right Library

### Library Comparison (10,000 records benchmark)

| Library | Serialize (s) | Deserialize (s) | Best For |
|---------|---------------|-----------------|----------|
| **orjson** | 0.42 | 1.27 | FastAPI, web APIs (3.9x faster) |
| **msgspec** | 0.49 | 0.93 | Maximum performance (1.7x faster deserialization) |
| **json** (stdlib) | 1.62 | 1.62 | Universal compatibility |
| **ujson** | 1.41 | 1.85 | Drop-in replacement (2x faster) |

**Recommendation:**

- Use **orjson** for FastAPI/web APIs (native support, fastest serialization)
- Use **msgspec** for data pipelines (fastest overall, typed validation)
- Use **json** when compatibility is critical

### Installation

```bash
# High-performance libraries
pip install orjson msgspec ujson

# Advanced querying
pip install jsonpath-ng jmespath

# Streaming large files
pip install ijson

# Schema validation
pip install jsonschema
```

## Large Files: Streaming Strategies

For files > 100MB, avoid loading into memory.

### Strategy 1: JSONL (JSON Lines)

Convert large JSON arrays to line-delimited format:

```python
# Stream process JSONL
with open("large.jsonl", "r") as infile, open("output.jsonl", "w") as outfile:
    for line in infile:
        obj = json.loads(line)
        obj["processed"] = True
        outfile.write(json.dumps(obj) + "\n")
```

### Strategy 2: Streaming with ijson

```python
import ijson

# Process large JSON without loading into memory
with open("huge.json", "rb") as f:
    for item in ijson.items(f, "products.item"):
        process(item)  # Handle one item at a time
```

See: `patterns/streaming-large-json.md`

## Security: Prevent JSON Injection

**Critical Rules:**

1. Always use `json.loads()`, never `eval()`
2. Validate input with `jsonschema`
3. Sanitize user input before serialization
4. Escape special characters (`"` and `\`)

**Vulnerable Code:**

```python
# NEVER DO THIS
username = request.GET['username']  # User input: admin", "role": "admin
json_string = f'{{"user":"{username}","role":"user"}}'
# Result: privilege escalation
```

**Secure Code:**

```python
# Use json.dumps for serialization
data = {"user": username, "role": "user"}
json_string = json.dumps(data)  # Properly escaped
```

See: `anti-patterns/security-json-injection.md`, `anti-patterns/eval-usage.md`

## Advanced: JSONPath for Complex Queries

Extract data from nested JSON without complex loops:

```python
import jsonpath_ng as jp

data = {
    "products": [
        {"name": "Apple", "price": 12.88},
        {"name": "Peach", "price": 27.25}
    ]
}

# Filter by price
query = jp.parse("products[?price>20].name")
results = [match.value for match in query.find(data)]
# Output: ["Peach"]
```

**Key Operators:**

- `$` - Root selector
- `..` - Recursive descendant
- `*` - Wildcard
- `[?<predicate>]` - Filter (e.g., `[?price > 20]`)
- `[start:end:step]` - Array slicing

See: `patterns/jsonpath-querying.md`

## Custom Objects: Serialization

Handle datetime, UUID, Decimal, and custom classes:

```python
from datetime import datetime
import json

class CustomEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, datetime):
            return obj.isoformat()
        if isinstance(obj, set):
            return list(obj)
        return super().default(obj)

# Usage
data = {"timestamp": datetime.now(), "tags": {"python", "json"}}
json_str = json.dumps(data, cls=CustomEncoder)
```

See: `patterns/custom-object-serialization.md`

## Performance Checklist

- [ ] Use orjson/msgspec for high-throughput applications
- [ ] Specify UTF-8 encoding when reading/writing files
- [ ] Use streaming (ijson/JSONL) for files > 100MB
- [ ] Minify JSON for production (`separators=(',', ':')`)
- [ ] Pretty-print for development (`indent=2`)

## Security Checklist

- [ ] Never use `eval()` for JSON parsing
- [ ] Validate input with `jsonschema`
- [ ] Sanitize user input before serialization
- [ ] Use `json.dumps()` to prevent injection
- [ ] Escape special characters in user data

## Reference Documentation

**Performance:**

- `reference/python-json-parsing-best-practices-2025.md` - Comprehensive research with benchmarks

**Patterns:**

- `patterns/streaming-large-json.md` - ijson and JSONL strategies
- `patterns/custom-object-serialization.md` - Handle datetime, UUID, custom classes
- `patterns/jsonpath-querying.md` - Advanced nested data extraction

**Security:**

- `anti-patterns/security-json-injection.md` - Prevent injection attacks
- `anti-patterns/eval-usage.md` - Why never to use eval()

**Examples:**

- `examples/high-performance-parsing.py` - orjson and msgspec code
- `examples/large-file-streaming.py` - Streaming with ijson
- `examples/secure-validation.py` - jsonschema validation

**Tools:**

- `tools/json-performance-benchmark.py` - Benchmark different libraries


---

## Referenced Files

> The following files are referenced in this skill and included for context.

### reference/python-json-parsing-best-practices-2025.md

```markdown
# Python JSON Parsing Best Practices (2025)

**Research Date:** October 31, 2025
**Query:** Best practices for parsing JSON in Python 2025

## Executive Summary

JSON parsing in Python has evolved significantly with performance-optimized libraries and enhanced security practices. This research identifies critical best practices for developers working with JSON data in 2025, covering library selection, performance optimization, security considerations, and handling large-scale datasets.

---

## 1. Core Library Selection & Performance

### Standard Library (`json` module)

The built-in `json` module remains the baseline for JSON operations in Python:

- **Serialization**: `json.dumps()` converts Python objects to JSON strings
- **Deserialization**: `json.loads()` parses JSON strings to Python objects
- **File Operations**: `json.dump()` and `json.load()` for direct file I/O
- **Performance**: Adequate for most use cases but slower than alternatives

**Key Insight**: Always specify encoding when working with files - UTF-8 is the recommended standard per RFC requirements.

```python
import json

# Best practice: Always specify encoding
with open("data.json", "r", encoding="utf-8") as f:
    data = json.load(f)
```

**Source**: [Real Python - Working With JSON Data in Python](https://realpython.com/python-json)

### High-Performance Alternatives (2025 Benchmarks)

Based on comprehensive benchmarking of 10,000 records with 10 runs:

| Library | Serialization (s) | Deserialization (s) | Key Features |
|---------|-------------------|---------------------|--------------|
| **orjson** | 0.417962 | 1.272813 | Rust-based, fastest serialization, built-in FastAPI support |
| **msgspec** | 0.489964 | 0.930834 | Ultra-fast, typed structs, supports YAML/TOML |
| **json** (stdlib) | 1.616786 | 1.616203 | Universal compatibility, stable |
| **ujson** | 1.413367 | 1.853332 | C-based, drop-in replacement |
| **rapidjson** | 2.044958 | 1.717067 | C++ wrapper, flexible |

**Recommendation**:
- Use **orjson** for web APIs (FastAPI native support, 3.9x faster serialization)
- Use **msgspec** for maximum performance across all operations (1.7x faster deserialization)
- Stick with **json** for compatibility-critical applications

**Source**: [DEV Community - Benchmarking Python JSON Libraries](https://dev.to/kanakos01/benchmarking-python-json-libraries-33bb)

---

## 2. Handling Large JSON Files

### Problem: Memory Constraints

Loading multi-million line JSON files with `json.load()` causes memory exhaustion.

### Solution Strategies

#### Strategy 1: JSON Lines (JSONL) Format

Convert large JSON arrays to line-delimited format for streaming processing:

```python
# Streaming read + process + write
with open("big.jsonl", "r") as infile, open("new.jsonl", "w") as outfile:
    for line in infile:
        obj = json.loads(line)
        obj["status"] = "processed"
        outfile.write(json.dumps(obj) + "\n")
```

**Benefits**:
- Easy appending of new records
- Line-by-line updates without rewriting entire file
- Native support in pandas, Spark, and `jq`

**Source**: [DEV Community - Handling Large JSON Files in Python](https://dev.to/lovestaco/handling-large-json-files-in-python-efficient-read-write-and-update-strategies-3jgg)

#### Strategy 2: Incremental Parsing with `ijson`

For true JSON arrays/objects, use streaming parsers:

```python
import ijson

# Process large file without loading into memory
with open("huge.json", "rb") as f:
    for item in ijson.items(f, "products.item"):
        process(item)  # Handle one item at a time
```

#### Strategy 3: Database Migration

For frequently queried/updated data, migrate from JSON to:
- **SQLite**: Lightweight, file-based
- **PostgreSQL/MongoDB**: Scalable solutions

**Critical Decision Matrix**:
- JSON additions only → JSONL format
- Batch updates → Stream read + rewrite
- Frequent random updates → Database

**Source**: [DEV Community - Handling Large JSON Files](https://dev.to/lovestaco/handling-large-json-files-in-python-efficient-read-write-and-update-strategies-3jgg)

---

## 3. Advanced Parsing with JSONPath & JMESPath

### JSONPath: XPath for JSON

Use JSONPath for nested data extraction with complex queries:

```python
import jsonpath_ng as jp

data = {
    "products": [
        {"name": "Apple", "price": 12.88},
        {"name": "Peach", "price": 27.25}
    ]
}

# Filter by price
query = jp.parse("products[?price>20].name")
results = [match.value for match in query.find(data)]
# Output: ["Peach"]
```

**Key Operators**:
- `$` - Root selector
- `..` - Recursive descendant
- `*` - Wildcard
- `[?<predicate>]` - Filter (e.g., `[?price > 20 & price < 100]`)
- `[start:end:step]` - Array slicing

**Use Cases**:
- Web scraping hidden JSON data in `<script>` tags
- Extracting nested API response data
- Complex filtering across multiple levels

**Source**: [ScrapFly - Introduction to Parsing JSON with Python JSONPath](https://scrapfly.io/blog/posts/parse-json-jsonpath-python)

### JMESPath Alternative

JMESPath offers easier dataset mutation and filtering for predictable structures, while JSONPath excels at extracting deeply nested data.

---

## 4. Security Best Practices

### Critical Vulnerabilities

#### JSON Injection Attacks

**Server-side injection** occurs when unsanitized user input is directly serialized:

```python
# VULNERABLE CODE
username = request.GET['username']  # User input: admin", "role": "administrator
json_string = f'{{"user":"{username}","role":"user"}}'
# Result: {"user":"admin", "role":"administrator", "role":"user"}
# Parser takes last role → privilege escalation
```

**Client-side injection** via `eval()`:

```python
# NEVER DO THIS
data = eval("(" + json_response + ")")  # Code execution risk!

# CORRECT APPROACH
data = json.loads(json_response)  # Safe parsing
```

**Source**: [Comparitech - JSON Injection Guide](https://www.comparitech.com/net-admin/json-injection-guide)

### Defense Strategies

1. **Input Sanitization**: Validate and escape all user input before serialization
2. **Never Use `eval()`**: Always use `json.loads()` or `JSON.parse()`
3. **Schema Validation**: Use `jsonschema` library to enforce data contracts
4. **Content Security Policy (CSP)**: Prevents eval() usage by default
5. **Escape Special Characters**: Properly escape `"` and `\` in user data

```python
from jsonschema import validate, ValidationError

schema = {
    "type": "object",
    "required": ["id", "name", "email"],
    "properties": {
        "id": {"type": "integer", "minimum": 1},
        "email": {"type": "string", "format": "email"}
    }
}

try:
    validate(instance=user_data, schema=schema)
except ValidationError as e:
    print(f"Invalid data: {e.message}")
```

**Source**: [Better Stack - Working With JSON Data in Python](https://betterstack.com/community/guides/scaling-nodejs/json-data-in-python)

---

## 5. Type Handling & Custom Objects

### Python ↔ JSON Type Mapping

| Python | → JSON | JSON → | Python |
|--------|--------|--------|--------|
| dict | object | object | dict |
| list, tuple | array | array | list ⚠️ |
| str | string | string | str |
| int, float | number | number | int/float |
| True/False | true/false | true/false | True/False |
| None | null | null | None |

**⚠️ Gotcha**: Tuples serialize to arrays but deserialize back to lists (data type loss).

### Custom Object Serialization

```python
from datetime import datetime
import json

class CustomEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, datetime):
            return obj.isoformat()
        if isinstance(obj, set):
            return list(obj)
        return super().default(obj)

# Usage
data = {"timestamp": datetime.now(), "tags": {"python", "json"}}
json_str = json.dumps(data, cls=CustomEncoder)
```

**Advanced Alternative**: Use **Pydantic** or **msgspec** for typed validation and automatic serialization.

**Source**: [Better Stack Community Guide](https://betterstack.com/community/guides/scaling-nodejs/json-data-in-python)

---

## 6. Formatting & Debugging

### Pretty Printing

```python
# Readable output with indentation
json_str = json.dumps(data, indent=2, sort_keys=True)

# Command-line validation and formatting
# Validate JSON file
python -m json.tool config.json

# Pretty-print to new file
python -m json.tool input.json output.json --indent 2
```

### Minification for Production

```python
# Remove all whitespace for minimal size
minified = json.dumps(data, separators=(',', ':'))

# Command line
python -m json.tool --compact input.json output.json
```

**Performance Impact**: Pretty-printed JSON can be 2x larger (308 bytes → 645 bytes in benchmarks).

**Source**: [Real Python - Working With JSON](https://realpython.com/python-json)

---

## 7. Web Scraping: Handling Non-Standard JSON

### ChompJS for JavaScript Objects

Many websites embed data in JavaScript objects that aren't valid JSON:

```python
import chompjs

# These are valid JS but invalid JSON:
js_objects = [
    "{'a': 'b'}",           # Single quotes
    "{a: 'b'}",             # Unquoted keys
    '{"a": [1,2,3,]}',      # Trailing comma
    '{"price": .99}'        # Missing leading zero
]

# ChompJS handles all of these
for js in js_objects:
    python_dict = chompjs.parse_js_object(js)
```

**Use Case**: Extracting hidden web data from `<script>` tags containing JavaScript initializers.

**Source**: [Zyte - JSON Parsing with Python](https://www.zyte.com/blog/json-parsing-with-python)

---

## 8. Production Optimization Checklist

### High-Performance Applications

Based on LinkedIn engineering insights for million-request APIs:

1. **Library Selection**:
   - FastAPI apps → `orjson` (native support, 4x faster)
   - Data pipelines → `msgspec` (fastest overall)
   - General use → `ujson` (2x faster than stdlib)

2. **Streaming for Scale**:
   - Use `ijson` for files > 100MB
   - Convert to JSONL for append-heavy workloads
   - Consider Protocol Buffers for ultra-high performance

3. **Buffer Optimization**:
   - Profile with `cProfile` to identify bottlenecks
   - Tune I/O buffer sizes for your workload
   - Use async I/O for concurrent request handling

4. **Monitoring**:
   - Track JSON processing time metrics
   - Set up alerts for parsing errors
   - Continuously benchmark against SLAs

**Source**: [LinkedIn - Optimizing JSON Parsing and Serialization](https://linkedin.com/pulse/optimizing-json-parsing-serialization-applications-amit-jindal-1g0tf)

---

## 9. Logging Best Practices

### Structured JSON Logging

```python
import logging
import json

# Configure JSON logging from the start
class JsonFormatter(logging.Formatter):
    def format(self, record):
        log_data = {
            "timestamp": self.formatTime(record),
            "level": record.levelname,
            "message": record.getMessage(),
            "user_id": getattr(record, 'user_id', None),
            "session_id": getattr(record, 'session_id', None)
        }
        return json.dumps(log_data)

# Benefits:
# - Easy parsing and searching
# - Structured database storage
# - Better correlation across services
```

**Schema Design Tips**:
- Use consistent key naming (snake_case recommended)
- Flatten structures when possible (concatenate keys with separator)
- Uniform data types per field
- Parse stack traces into hierarchical attributes

**Source**: [Graylog - What To Know About Parsing JSON](https://graylog.org/post/what-to-know-parsing-json)

---

## 10. Key Takeaways for Developers

### Must-Do Practices

1. ✅ **Always use `json.loads()`, never `eval()`** for security
2. ✅ **Specify UTF-8 encoding** when reading/writing files
3. ✅ **Validate input** with `jsonschema` before processing
4. ✅ **Choose performance library** based on workload (orjson/msgspec)
5. ✅ **Use JSONL format** for large, append-heavy datasets
6. ✅ **Implement streaming** for files > 100MB
7. ✅ **Pretty-print for development**, minify for production
8. ✅ **Add structured logging** from project start

### Common Pitfalls to Avoid

1. ❌ Loading entire large files into memory
2. ❌ Using `eval()` for JSON parsing
3. ❌ Skipping input validation on user data
4. ❌ Ignoring type conversions (tuple → list)
5. ❌ Not handling exceptions properly
6. ❌ Over-logging during parsing (performance impact)
7. ❌ Using sequential IDs (security risk - use UUID/GUID)

---

## References

1. [Real Python - Working With JSON Data in Python](https://realpython.com/python-json) - Comprehensive guide to json module, Aug 2025
2. [Better Stack Community - JSON Data in Python](https://betterstack.com/community/guides/scaling-nodejs/json-data-in-python) - Advanced techniques and validation, Apr 2025
3. [DEV Community - Handling Large JSON Files](https://dev.to/lovestaco/handling-large-json-files-in-python-efficient-read-write-and-update-strategies-3jgg) - Strategies for massive datasets, Oct 2025
4. [ScrapFly - JSONPath in Python](https://scrapfly.io/blog/posts/parse-json-jsonpath-python) - Advanced querying techniques, Sep 2025
5. [DEV Community - Benchmarking JSON Libraries](https://dev.to/kanakos01/benchmarking-python-json-libraries-33bb) - Performance comparison, Jul 2025
6. [LinkedIn - Optimizing JSON for High-Performance](https://linkedin.com/pulse/optimizing-json-parsing-serialization-applications-amit-jindal-1g0tf) - Enterprise optimization, Mar 2025
7. [Graylog - Parsing JSON](https://graylog.org/post/what-to-know-parsing-json) - Logging best practices, Mar 2025
8. [Comparitech - JSON Injection Guide](https://www.comparitech.com/net-admin/json-injection-guide) - Security vulnerabilities, Nov 2024
9. [Zyte - JSON Parsing with Python](https://www.zyte.com/blog/json-parsing-with-python) - Practical guide, Dec 2024

---

**Document Version:** 1.0
**Last Updated:** October 31, 2025
**Maintained By:** Lunar Claude Research Team

```

python-json-parsing | SkillHub