optimizing-sql
Optimize SQL query performance through EXPLAIN analysis, indexing strategies, and query rewriting for PostgreSQL, MySQL, and SQL Server. Use when debugging slow queries, analyzing execution plans, or improving database performance.
Packaged view
This page reorganizes the original catalog entry around fit, installability, and workflow context first. The original raw source lives below.
Install command
npx @skill-hub/cli install ancoleman-ai-design-components-optimizing-sql
Repository
Skill path: skills/optimizing-sql
Optimize SQL query performance through EXPLAIN analysis, indexing strategies, and query rewriting for PostgreSQL, MySQL, and SQL Server. Use when debugging slow queries, analyzing execution plans, or improving database performance.
Open repositoryBest for
Primary workflow: Ship Full Stack.
Technical facets: Full Stack, Backend.
Target audience: everyone.
License: Unknown.
Original source
Catalog source: SkillHub Club.
Repository owner: ancoleman.
This is still a mirrored public skill entry. Review the repository before installing into production workflows.
What it helps with
- Install optimizing-sql into Claude Code, Codex CLI, Gemini CLI, or OpenCode workflows
- Review https://github.com/ancoleman/ai-design-components before adding optimizing-sql to shared team environments
- Use optimizing-sql for development workflows
Works across
Favorites: 0.
Sub-skills: 0.
Aggregator: No.
Original source / Raw SKILL.md
--- name: optimizing-sql description: Optimize SQL query performance through EXPLAIN analysis, indexing strategies, and query rewriting for PostgreSQL, MySQL, and SQL Server. Use when debugging slow queries, analyzing execution plans, or improving database performance. --- # SQL Optimization Provide tactical guidance for optimizing SQL query performance across PostgreSQL, MySQL, and SQL Server through execution plan analysis, strategic indexing, and query rewriting. ## When to Use This Skill Trigger this skill when encountering: - Slow query performance or database timeouts - Analyzing EXPLAIN plans or execution plans - Determining index requirements - Rewriting inefficient queries - Identifying query anti-patterns (N+1, SELECT *, correlated subqueries) - Database-specific optimization needs (PostgreSQL, MySQL, SQL Server) ## Core Optimization Workflow ### Step 1: Analyze Query Performance Run execution plan analysis to identify bottlenecks: **PostgreSQL:** ```sql EXPLAIN ANALYZE SELECT * FROM users WHERE email = '[email protected]'; ``` **MySQL:** ```sql EXPLAIN FORMAT=JSON SELECT * FROM products WHERE category_id = 5; ``` **SQL Server:** Use SQL Server Management Studio: Display Estimated Execution Plan (Ctrl+L) **Key Metrics to Monitor:** - **Cost**: Estimated resource consumption - **Rows**: Number of rows processed (estimated vs actual) - **Scan Type**: Sequential scan vs index scan - **Execution Time**: Actual time spent on operation For detailed execution plan interpretation, see `references/explain-guide.md`. ### Step 2: Identify Optimization Opportunities **Common Red Flags:** | Indicator | Problem | Solution | |-----------|---------|----------| | Seq Scan / Table Scan | Full table scan on large table | Add index on filter columns | | High row count | Processing excessive rows | Add WHERE filter or index | | Nested Loop with large outer table | Inefficient join algorithm | Index join columns | | Correlated subquery | Subquery executes per row | Rewrite as JOIN or EXISTS | | Sort operation on large result set | Expensive sorting | Add index matching ORDER BY | For scan type interpretation, see `references/scan-types.md`. ### Step 3: Apply Indexing Strategies **Index Decision Framework:** ``` Is column used in WHERE, JOIN, ORDER BY, or GROUP BY? ├─ YES → Is column selective (many unique values)? │ ├─ YES → Is table frequently queried? │ │ ├─ YES → ADD INDEX │ │ └─ NO → Consider based on query frequency │ └─ NO (low selectivity) → Skip index └─ NO → Skip index ``` **Index Types by Use Case:** **PostgreSQL:** - **B-tree** (default): General-purpose, supports <, ≤, =, ≥, >, BETWEEN, IN - **Hash**: Equality comparisons only (=) - **GIN**: Full-text search, JSONB, arrays - **GiST**: Spatial data, geometric types - **BRIN**: Very large tables with naturally ordered data **MySQL:** - **B-tree** (default): General-purpose index - **Full-text**: Text search on VARCHAR/TEXT columns - **Spatial**: Spatial data types **SQL Server:** - **Clustered**: Table data sorted by index (one per table) - **Non-clustered**: Separate index structure (multiple allowed) For comprehensive indexing guidance, see `references/indexing-decisions.md` and `references/index-types.md`. ### Step 4: Design Composite Indexes For queries filtering on multiple columns, use composite indexes: **Column Order Matters:** 1. **Equality filters first** (most selective) 2. **Additional equality filters** (by selectivity) 3. **Range filters or ORDER BY** (last) **Example:** ```sql -- Query pattern SELECT * FROM orders WHERE customer_id = 123 AND status = 'shipped' ORDER BY created_at DESC LIMIT 10; -- Optimal composite index CREATE INDEX idx_orders_customer_status_created ON orders (customer_id, status, created_at DESC); ``` For composite index design patterns, see `references/composite-indexes.md`. ### Step 5: Rewrite Inefficient Queries **Common Anti-Patterns to Avoid:** **1. SELECT * (Over-fetching)** ```sql -- ❌ Bad: Fetches all columns SELECT * FROM users WHERE id = 1; -- ✅ Good: Fetch only needed columns SELECT id, name, email FROM users WHERE id = 1; ``` **2. N+1 Queries** ```sql -- ❌ Bad: 1 + N queries SELECT * FROM users LIMIT 100; -- Then in loop: SELECT * FROM posts WHERE user_id = ?; -- ✅ Good: Single JOIN SELECT users.*, posts.id AS post_id, posts.title FROM users LEFT JOIN posts ON users.id = posts.user_id; ``` **3. Non-Sargable Queries** (functions on indexed columns) ```sql -- ❌ Bad: Function prevents index usage SELECT * FROM orders WHERE YEAR(created_at) = 2025; -- ✅ Good: Sargable range condition SELECT * FROM orders WHERE created_at >= '2025-01-01' AND created_at < '2026-01-01'; ``` **4. Correlated Subqueries** ```sql -- ❌ Bad: Subquery executes per row SELECT name, (SELECT COUNT(*) FROM orders WHERE orders.user_id = users.id) FROM users; -- ✅ Good: JOIN with GROUP BY SELECT users.name, COUNT(orders.id) AS order_count FROM users LEFT JOIN orders ON users.id = orders.user_id GROUP BY users.id, users.name; ``` For complete anti-pattern reference, see `references/anti-patterns.md`. For efficient query patterns, see `references/efficient-patterns.md`. ## Quick Reference Tables ### Index Selection Guide | Query Pattern | Index Type | Example | |--------------|------------|---------| | `WHERE column = value` | Single-column B-tree | `CREATE INDEX ON table (column)` | | `WHERE col1 = ? AND col2 = ?` | Composite B-tree | `CREATE INDEX ON table (col1, col2)` | | `WHERE text_col LIKE '%word%'` | Full-text (GIN/Full-text) | `CREATE INDEX ON table USING GIN (to_tsvector('english', text_col))` | | `WHERE geom && box` | Spatial (GiST) | `CREATE INDEX ON table USING GIST (geom)` | | `WHERE json_col @> '{"key":"value"}'` | JSONB (GIN) | `CREATE INDEX ON table USING GIN (json_col)` | ### Join Optimization Checklist - [ ] Index foreign key columns on both sides of JOIN - [ ] Order joins starting with table returning fewest rows - [ ] Use INNER JOIN when possible (more efficient than OUTER JOIN) - [ ] Avoid joining more than 5 tables (break into CTEs or subqueries) - [ ] Consider denormalization for frequently joined tables in read-heavy systems ### Execution Plan Performance Targets | Scan Type | Performance | When Acceptable | |-----------|-------------|-----------------| | Index-Only Scan | Best | Always preferred | | Index Scan | Excellent | Small-medium result sets | | Bitmap Heap Scan | Good | Medium result sets (PostgreSQL) | | Sequential Scan | Poor | Only for small tables (<1000 rows) or full table queries | | Table Scan | Poor | Only for small tables or unavoidable full scans | ## Database-Specific Optimizations ### PostgreSQL-Specific Features **Partial Indexes** (index subset of rows): ```sql CREATE INDEX idx_active_users_login ON users (last_login) WHERE status = 'active'; ``` **Expression Indexes** (index computed values): ```sql CREATE INDEX idx_users_email_lower ON users (LOWER(email)); ``` **Covering Indexes** (avoid heap access): ```sql CREATE INDEX idx_users_email_covering ON users (email) INCLUDE (id, name); ``` For comprehensive PostgreSQL optimization, see `references/postgresql.md`. ### MySQL-Specific Features **Index Hints** (override optimizer): ```sql SELECT * FROM orders USE INDEX (idx_orders_customer) WHERE customer_id = 123; ``` **Storage Engine Selection:** - **InnoDB** (default): Transactional, row-level locks, clustered primary key - **MyISAM**: Faster reads, no transactions, table-level locks For comprehensive MySQL optimization, see `references/mysql.md`. ### SQL Server-Specific Features **Query Store** (track query performance over time): ```sql ALTER DATABASE YourDatabase SET QUERY_STORE = ON; ``` **Execution Plan Warnings:** - Look for yellow exclamation marks in graphical execution plans - Thick arrows indicate high row counts For comprehensive SQL Server optimization, see `references/sqlserver.md`. ## Advanced Optimization Techniques ### Common Table Expressions (CTEs) Break complex queries into readable, maintainable parts: ```sql WITH active_customers AS ( SELECT id, name FROM customers WHERE status = 'active' ), recent_orders AS ( SELECT customer_id, COUNT(*) as order_count FROM orders WHERE created_at > NOW() - INTERVAL '30 days' GROUP BY customer_id ) SELECT ac.name, COALESCE(ro.order_count, 0) as orders FROM active_customers ac LEFT JOIN recent_orders ro ON ac.id = ro.customer_id; ``` ### EXISTS vs IN for Subqueries Use EXISTS for better performance with large datasets: ```sql -- ✅ Good: EXISTS stops at first match SELECT * FROM users WHERE EXISTS (SELECT 1 FROM orders WHERE orders.user_id = users.id); -- ❌ Less efficient: IN builds full list SELECT * FROM users WHERE id IN (SELECT user_id FROM orders); ``` ### Denormalization Decision Framework Consider denormalization when: - Query joins 3+ tables frequently - Data is relatively static (infrequent updates) - Read performance is critical - Write overhead is acceptable **Denormalization Strategies:** 1. **Duplicate columns**: Copy foreign key data into main table 2. **Summary tables**: Pre-aggregate data 3. **Materialized views**: Database-maintained denormalized views 4. **Application caching**: Redis/Memcached for frequently accessed data ## Optimization Workflow Example **Scenario:** API endpoint taking 2 seconds to load **Step 1: Identify Slow Query** ``` Use APM/observability tools to identify database query causing delay ``` **Step 2: Run EXPLAIN ANALYZE** ```sql EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 123 ORDER BY created_at DESC LIMIT 10; ``` **Step 3: Analyze Output** ``` Seq Scan on orders (cost=0.00..2500.00 rows=10) Filter: (customer_id = 123) Rows Removed by Filter: 99990 ``` **Problem**: Sequential scan filtering 99,990 rows **Step 4: Add Composite Index** ```sql CREATE INDEX idx_orders_customer_created ON orders (customer_id, created_at DESC); ``` **Step 5: Verify Improvement** ```sql EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 123 ORDER BY created_at DESC LIMIT 10; ``` ``` Index Scan using idx_orders_customer_created (cost=0.42..12.44 rows=10) Index Cond: (customer_id = 123) ``` **Result**: 200x faster (2000ms → 10ms) ## Monitoring and Maintenance **Regular Optimization Tasks:** - Review slow query logs weekly - Update database statistics regularly (ANALYZE in PostgreSQL, UPDATE STATISTICS in SQL Server) - Monitor index usage (drop unused indexes) - Archive old data to keep tables manageable - Review execution plans for critical queries quarterly **PostgreSQL Statistics Update:** ```sql ANALYZE table_name; ``` **MySQL Statistics Update:** ```sql ANALYZE TABLE table_name; ``` **SQL Server Statistics Update:** ```sql UPDATE STATISTICS table_name; ``` ## Related Skills - **databases-relational**: Schema design and database fundamentals - **observability**: Performance monitoring and slow query detection - **api-patterns**: API-level optimization (pagination, caching) - **performance-engineering**: Application performance profiling ## Additional Resources For comprehensive documentation, reference these files: - `references/explain-guide.md` - Detailed EXPLAIN plan interpretation - `references/scan-types.md` - Scan type meanings and performance implications - `references/indexing-decisions.md` - When and how to add indexes - `references/index-types.md` - Database-specific index types - `references/composite-indexes.md` - Multi-column index design - `references/anti-patterns.md` - Common anti-patterns with solutions - `references/efficient-patterns.md` - Efficient query patterns - `references/postgresql.md` - PostgreSQL-specific optimizations - `references/mysql.md` - MySQL-specific optimizations - `references/sqlserver.md` - SQL Server-specific optimizations For working SQL examples, see `examples/` directory. --- ## Referenced Files > The following files are referenced in this skill and included for context. ### references/explain-guide.md ```markdown # EXPLAIN Analysis Guide Comprehensive guide to interpreting execution plans across PostgreSQL, MySQL, and SQL Server. ## Table of Contents 1. [PostgreSQL EXPLAIN](#postgresql-explain) 2. [MySQL EXPLAIN](#mysql-explain) 3. [SQL Server Execution Plans](#sql-server-execution-plans) 4. [Key Metrics to Monitor](#key-metrics-to-monitor) 5. [Common Patterns and Solutions](#common-patterns-and-solutions) ## PostgreSQL EXPLAIN ### Basic EXPLAIN Syntax **EXPLAIN** - Show query plan without execution: ```sql EXPLAIN SELECT * FROM users WHERE email = '[email protected]'; ``` **EXPLAIN ANALYZE** - Execute query and show actual timing: ```sql EXPLAIN ANALYZE SELECT * FROM users WHERE email = '[email protected]'; ``` **EXPLAIN (ANALYZE, BUFFERS)** - Include buffer usage: ```sql EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM users WHERE email = '[email protected]'; ``` ### Reading PostgreSQL Output **Example Output:** ``` Seq Scan on users (cost=0.00..1500.00 rows=1 width=100) (actual time=50.123..50.124 rows=1 loops=1) Filter: (email = '[email protected]'::text) Rows Removed by Filter: 99999 Planning Time: 0.100 ms Execution Time: 50.150 ms ``` **Key Components:** - **Operation**: `Seq Scan` (scan type) - **Table**: `users` (table being scanned) - **cost=0.00..1500.00**: Estimated cost range (startup..total) - **rows=1**: Estimated rows returned - **width=100**: Average row size in bytes - **actual time=50.123..50.124**: Actual time range (milliseconds) - **rows=1**: Actual rows returned - **loops=1**: Number of times operation executed **Cost Interpretation:** - Cost is arbitrary units (not milliseconds) - Compare relative costs between plans - Lower cost = better (usually) ### PostgreSQL Scan Types **Sequential Scan (Seq Scan):** - Reads entire table from disk - No index used - Acceptable for small tables or full table queries - **Red flag** for large tables with WHERE clause **Index Scan:** - Direct index traversal - Excellent for small result sets - Accesses heap table to retrieve full rows **Index-Only Scan:** - All data retrieved from index (no heap access) - Best performance - Requires covering index with all needed columns **Bitmap Heap Scan:** - Two-step process: identify rows in index → fetch from heap - Efficient for medium result sets - Combines multiple index scans **Nested Loop Join:** - Iterate outer table, lookup inner table per row - Good when outer table is small - Requires index on inner table join column **Hash Join:** - Build hash table of inner table, probe with outer - Good for large tables - Memory-intensive **Merge Join:** - Sort both tables, merge sorted results - Good for pre-sorted data - Expensive if sorting required ### PostgreSQL Optimization Examples **Example 1: Sequential Scan → Index Scan** **Before:** ```sql EXPLAIN ANALYZE SELECT * FROM users WHERE email = '[email protected]'; ``` ``` Seq Scan on users (cost=0.00..1500.00 rows=1) (actual time=50.123..50.124 rows=1) Filter: (email = '[email protected]') Rows Removed by Filter: 99999 ``` **Optimization:** ```sql CREATE INDEX idx_users_email ON users (email); ``` **After:** ```sql EXPLAIN ANALYZE SELECT * FROM users WHERE email = '[email protected]'; ``` ``` Index Scan using idx_users_email on users (cost=0.42..8.44 rows=1) (actual time=0.025..0.026 rows=1) Index Cond: (email = '[email protected]') ``` **Result:** 1000x faster (50ms → 0.05ms) ## MySQL EXPLAIN ### Basic EXPLAIN Syntax **Standard EXPLAIN:** ```sql EXPLAIN SELECT * FROM products WHERE category_id = 5 AND price > 100; ``` **JSON Format** (detailed output): ```sql EXPLAIN FORMAT=JSON SELECT * FROM products WHERE category_id = 5; ``` ### Reading MySQL Output **Example Output:** ``` +----+-------------+----------+------+---------------+------+---------+------+-------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+----------+------+---------------+------+---------+------+-------+-------------+ | 1 | SIMPLE | products | ALL | NULL | NULL | NULL | NULL | 50000 | Using where | +----+-------------+----------+------+---------------+------+---------+------+-------+-------------+ ``` **Key Columns:** - **id**: Query identifier - **select_type**: SIMPLE, PRIMARY, SUBQUERY, DERIVED, UNION - **table**: Table being accessed - **type**: Access type (performance indicator) - **possible_keys**: Indexes MySQL could use - **key**: Index actually used (NULL = no index) - **key_len**: Bytes of index used - **ref**: Columns/constants compared to index - **rows**: Estimated rows examined - **Extra**: Additional information ### MySQL Access Types (type column) **Best to Worst Performance:** 1. **system** - Single row table (best) 2. **const** - Primary key or unique index lookup 3. **eq_ref** - One row from table for each previous row combination 4. **ref** - Non-unique index lookup 5. **range** - Index range scan (BETWEEN, >, <, IN) 6. **index** - Full index scan 7. **ALL** - Full table scan (worst) **Target:** Achieve `const`, `eq_ref`, `ref`, or `range` types. **Red Flag:** `ALL` (full table scan) on large tables. ### MySQL Extra Column Meanings - **Using index**: Index-only scan (excellent) - **Using where**: Filtering rows after retrieval - **Using temporary**: Temporary table created (expensive) - **Using filesort**: Sorting required (expensive for large result sets) - **Using join buffer**: Join buffer used (index missing on join column) ### MySQL Optimization Examples **Example 1: ALL → range** **Before:** ```sql EXPLAIN SELECT * FROM products WHERE category_id = 5 AND price > 100; ``` ``` type: ALL, possible_keys: NULL, key: NULL, rows: 50000 ``` **Optimization:** ```sql CREATE INDEX idx_products_category_price ON products (category_id, price); ``` **After:** ```sql EXPLAIN SELECT * FROM products WHERE category_id = 5 AND price > 100; ``` ``` type: range, key: idx_products_category_price, rows: 150 ``` **Result:** Rows examined reduced from 50,000 to 150. ## SQL Server Execution Plans ### Accessing Execution Plans **Estimated Execution Plan** (Ctrl+L in SSMS): ```sql -- Right-click query → Display Estimated Execution Plan SELECT * FROM Sales.Orders WHERE CustomerID = 123; ``` **Actual Execution Plan** (Ctrl+M, then execute): ```sql -- Query → Include Actual Execution Plan → Execute SELECT * FROM Sales.Orders WHERE CustomerID = 123; ``` ### Reading SQL Server Execution Plans **Graphical Execution Plan Components:** - **Operations**: Boxes representing operations (Scan, Seek, Join) - **Arrows**: Data flow direction (right to left) - **Thickness**: Relative row count (thick = many rows) - **Warnings**: Yellow exclamation marks (missing indexes, implicit conversions) - **Cost %**: Percentage of total query cost **Read Direction:** Right to left, top to bottom. ### SQL Server Scan Types **Table Scan:** - Reads entire table - No index available - Red flag for large tables **Clustered Index Scan:** - Reads entire clustered index (full table) - Similar to table scan **Index Seek:** - Direct index lookup - Excellent performance - Target for WHERE, JOIN conditions **Index Scan:** - Reads entire index - Better than table scan if index is smaller - Still inefficient for large indexes **Key Lookup:** - Additional lookup to retrieve non-indexed columns - Indicates covering index opportunity ### SQL Server Optimization Examples **Example 1: Identify Missing Index** **Execution Plan Warning:** ``` Missing Index (Impact 95%) CREATE NONCLUSTERED INDEX [<Name of Missing Index>] ON [dbo].[Orders] ([CustomerID]) ``` **Action:** ```sql CREATE NONCLUSTERED INDEX IX_Orders_CustomerID ON dbo.Orders (CustomerID); ``` **Example 2: Query Store Analysis** Find top 10 expensive queries: ```sql SELECT TOP 10 q.query_id, qt.query_sql_text, rs.avg_duration / 1000.0 AS avg_duration_ms, rs.avg_logical_io_reads, rs.count_executions FROM sys.query_store_query q INNER JOIN sys.query_store_query_text qt ON q.query_text_id = qt.query_text_id INNER JOIN sys.query_store_plan p ON q.query_id = p.query_id INNER JOIN sys.query_store_runtime_stats rs ON p.plan_id = rs.plan_id ORDER BY rs.avg_duration DESC; ``` ## Key Metrics to Monitor ### Cross-Database Metrics | Metric | Good | Warning | Critical | |--------|------|---------|----------| | Rows examined vs returned | <10x | 10-100x | >100x | | Execution time | <10ms | 10-100ms | >100ms | | Index usage | Present | Partial | None | | Sort operations | None | Small dataset | Large dataset | ### PostgreSQL-Specific Metrics - **Buffer hits vs reads**: High hit ratio indicates good cache usage - **Planning time**: Should be <1ms typically - **Execution time**: Target <100ms for user-facing queries ### MySQL-Specific Metrics - **Handler_read_rnd_next**: High value indicates full table scans - **Created_tmp_tables**: Temporary table creation count - **Sort_scan**: Number of sorts requiring table scan ### SQL Server-Specific Metrics - **Logical reads**: Pages read from cache - **Physical reads**: Pages read from disk (minimize) - **CPU time**: CPU milliseconds consumed ## Common Patterns and Solutions ### Pattern 1: High Row Count with Low Results **Symptom:** ``` Seq Scan on table (cost=0.00..10000.00 rows=100000) Filter: (column = value) Rows Removed by Filter: 99999 ``` **Solution:** Add index on filter column ```sql CREATE INDEX idx_table_column ON table (column); ``` ### Pattern 2: Nested Loop with Large Outer Table **Symptom:** ``` Nested Loop (cost=0.00..50000.00 rows=10000) -> Seq Scan on large_table (rows=10000) -> Index Scan on small_table ``` **Solutions:** 1. Add index on large_table filter columns (reduce outer rows) 2. Reorder joins (start with smaller result set) 3. Force hash join if appropriate (PostgreSQL: `SET enable_nestloop = off`) ### Pattern 3: Sort Operation on Large Result Set **Symptom:** ``` Sort (cost=5000.00..6000.00 rows=100000) Sort Key: created_at DESC -> Seq Scan on orders ``` **Solution:** Create index matching ORDER BY ```sql CREATE INDEX idx_orders_created ON orders (created_at DESC); ``` ### Pattern 4: Multiple OR Conditions **Symptom:** ```sql SELECT * FROM users WHERE status = 'active' OR status = 'pending' OR status = 'verified'; ``` ``` Seq Scan on users Filter: ((status = 'active') OR (status = 'pending') OR (status = 'verified')) ``` **Solution:** Use IN or UNION ALL ```sql -- Better: Use IN SELECT * FROM users WHERE status IN ('active', 'pending', 'verified'); -- Or: Use UNION ALL with separate indexes SELECT * FROM users WHERE status = 'active' UNION ALL SELECT * FROM users WHERE status = 'pending' UNION ALL SELECT * FROM users WHERE status = 'verified'; ``` ## Quick Reference: EXPLAIN Command Comparison | Feature | PostgreSQL | MySQL | SQL Server | |---------|-----------|-------|------------| | Basic plan | `EXPLAIN` | `EXPLAIN` | Ctrl+L (SSMS) | | Execute + timing | `EXPLAIN ANALYZE` | N/A | Ctrl+M, execute | | Detailed output | `EXPLAIN (ANALYZE, BUFFERS, VERBOSE)` | `EXPLAIN FORMAT=JSON` | Execution Plan XML | | Cost shown | Yes (arbitrary units) | Yes (not displayed in output) | Yes (percentage) | | Actual rows | With ANALYZE | No | With actual plan | | Index recommendations | No | No | Yes (warnings) | ## Best Practices 1. **Always run EXPLAIN before optimizing** - Understand the problem before solving it 2. **Compare before/after plans** - Verify optimizations work 3. **Use ANALYZE variant when possible** - Actual timing beats estimates 4. **Check row estimates vs actuals** - Large discrepancies indicate outdated statistics 5. **Update statistics regularly** - Run ANALYZE/UPDATE STATISTICS weekly 6. **Monitor production queries** - Enable slow query log or Query Store 7. **Archive execution plans** - Track performance changes over time ``` ### references/scan-types.md ```markdown # Scan Types Interpretation Detailed guide to understanding scan types and join algorithms across PostgreSQL, MySQL, and SQL Server. ## PostgreSQL Scan Types ### Sequential Scan (Seq Scan) **Description:** Reads entire table sequentially from disk. **When Used:** - No suitable index available - Small tables (optimizer determines full scan is cheaper) - Query needs majority of table rows **Performance:** - Poor for large tables with selective WHERE clause - Acceptable for small tables (<1000 rows) - Acceptable when retrieving large percentage of rows **Example:** ```sql EXPLAIN SELECT * FROM users WHERE last_login < '2020-01-01'; ``` ``` Seq Scan on users (cost=0.00..1500.00 rows=50000) Filter: (last_login < '2020-01-01') ``` **Optimization:** Add index on `last_login` ### Index Scan **Description:** Traverses index to find matching rows, then fetches full rows from heap table. **When Used:** - Index available on WHERE/JOIN columns - Small to medium result sets - Random access pattern acceptable **Performance:** - Excellent for selective queries (<5% of table) - Better than Seq Scan for most filtered queries **Example:** ```sql EXPLAIN SELECT * FROM users WHERE email = '[email protected]'; ``` ``` Index Scan using idx_users_email on users (cost=0.42..8.44 rows=1) Index Cond: (email = '[email protected]') ``` **Key Indicator:** `Index Cond` shows index is being used. ### Index-Only Scan **Description:** Retrieves all data from index without accessing heap table. **When Used:** - Covering index contains all needed columns - Index includes WHERE and SELECT columns **Performance:** - Best possible performance - No heap access required - Minimal I/O **Example:** ```sql CREATE INDEX idx_users_email_covering ON users (email) INCLUDE (id, name); EXPLAIN SELECT id, name FROM users WHERE email = '[email protected]'; ``` ``` Index Only Scan using idx_users_email_covering on users (cost=0.42..4.44 rows=1) Index Cond: (email = '[email protected]') ``` **Optimization Goal:** Design covering indexes for frequently-run queries. ### Bitmap Heap Scan **Description:** Two-step process: 1. Bitmap Index Scan: Create bitmap of matching row locations 2. Bitmap Heap Scan: Fetch rows from heap using bitmap **When Used:** - Medium-sized result sets (5-50% of table) - Multiple index conditions combined with OR - PostgreSQL optimizer determines it's more efficient than Index Scan **Performance:** - More efficient than Index Scan for medium result sets - Allows combining multiple indexes **Example:** ```sql EXPLAIN SELECT * FROM orders WHERE status = 'pending' AND priority = 'high'; ``` ``` Bitmap Heap Scan on orders (cost=20.00..500.00 rows=1000) Recheck Cond: ((status = 'pending') AND (priority = 'high')) -> Bitmap Index Scan on idx_orders_status (cost=0.00..10.00 rows=2000) Index Cond: (status = 'pending') -> Bitmap Index Scan on idx_orders_priority (cost=0.00..10.00 rows=1500) Index Cond: (priority = 'high') ``` **Key Indicator:** Multiple bitmap index scans combined. ## PostgreSQL Join Algorithms ### Nested Loop Join **Description:** For each row in outer table, scan inner table for matches. **When Used:** - Small outer table - Index on inner table join column - Selective join conditions **Performance:** - Excellent when outer table is small (<100 rows) - Poor when outer table is large **Example:** ```sql EXPLAIN SELECT * FROM small_table INNER JOIN large_table ON small_table.id = large_table.small_id; ``` ``` Nested Loop (cost=0.42..100.00 rows=10) -> Seq Scan on small_table (cost=0.00..1.10 rows=10) -> Index Scan using idx_large_small_id on large_table (cost=0.42..9.89 rows=1) Index Cond: (small_id = small_table.id) ``` **Optimization:** Ensure inner table has index on join column. ### Hash Join **Description:** 1. Build hash table of inner table 2. Probe hash table with outer table rows **When Used:** - Large tables being joined - Equality join conditions (=) - No suitable indexes **Performance:** - Excellent for large tables - Memory-intensive (hash table must fit in work_mem) - Single pass through each table **Example:** ```sql EXPLAIN SELECT * FROM orders INNER JOIN customers ON orders.customer_id = customers.id; ``` ``` Hash Join (cost=500.00..5000.00 rows=10000) Hash Cond: (orders.customer_id = customers.id) -> Seq Scan on orders (cost=0.00..1000.00 rows=10000) -> Hash (cost=250.00..250.00 rows=5000) -> Seq Scan on customers (cost=0.00..250.00 rows=5000) ``` **Tuning:** Increase `work_mem` if hash join spills to disk. ### Merge Join **Description:** 1. Sort both tables by join key 2. Merge sorted results **When Used:** - Both tables already sorted by join key - Merge join cheaper than hash join - Equality join conditions **Performance:** - Excellent when data pre-sorted - Expensive if sorting required - Efficient for very large tables (no memory constraints) **Example:** ```sql EXPLAIN SELECT * FROM table1 INNER JOIN table2 ON table1.sorted_col = table2.sorted_col; ``` ``` Merge Join (cost=1000.00..2000.00 rows=10000) Merge Cond: (table1.sorted_col = table2.sorted_col) -> Index Scan using idx_table1_sorted on table1 (cost=0.00..500.00 rows=10000) -> Index Scan using idx_table2_sorted on table2 (cost=0.00..500.00 rows=10000) ``` **Optimization:** Add indexes matching join columns for pre-sorted data. ## MySQL Access Types ### ALL (Table Scan) **Description:** Full table scan, reads every row. **When Used:** - No suitable index - Small tables - Query retrieves most rows **Performance:** - Worst performance for large tables - Acceptable for small tables (<1000 rows) **Example:** ```sql EXPLAIN SELECT * FROM products WHERE description LIKE '%widget%'; ``` ``` type: ALL, rows: 50000 ``` **Optimization:** Add index or use full-text search. ### const **Description:** Single row lookup via primary key or unique index. **When Used:** - WHERE clause on primary key or unique index with constant value - Comparison with constant value **Performance:** - Best possible performance - Single row access **Example:** ```sql EXPLAIN SELECT * FROM users WHERE id = 123; ``` ``` type: const, key: PRIMARY, rows: 1 ``` **Key Indicator:** `type: const`, `rows: 1` ### eq_ref **Description:** One row from table for each combination of rows from previous tables. **When Used:** - JOIN on primary key or unique index - All parts of index used **Performance:** - Excellent for joins - One row per outer table row **Example:** ```sql EXPLAIN SELECT * FROM orders INNER JOIN customers ON orders.customer_id = customers.id; ``` ``` type: eq_ref, key: PRIMARY, ref: orders.customer_id, rows: 1 ``` ### ref **Description:** Non-unique index lookup. **When Used:** - Index used but not unique - Multiple rows may match **Performance:** - Good performance - Efficient for moderately selective queries **Example:** ```sql EXPLAIN SELECT * FROM orders WHERE customer_id = 123; ``` ``` type: ref, key: idx_orders_customer, ref: const, rows: 10 ``` ### range **Description:** Index range scan using BETWEEN, <, >, IN, LIKE. **When Used:** - Range conditions on indexed columns - IN with small list - LIKE with prefix (not wildcard at start) **Performance:** - Good performance - Better than full table scan **Example:** ```sql EXPLAIN SELECT * FROM orders WHERE created_at > '2025-01-01'; ``` ``` type: range, key: idx_orders_created, rows: 1000 ``` ### index **Description:** Full index scan (reads entire index). **When Used:** - All needed columns in index (covering index) - Scanning entire index cheaper than table scan **Performance:** - Better than ALL if index smaller than table - Still inefficient for large indexes **Example:** ```sql EXPLAIN SELECT id FROM users; ``` ``` type: index, key: PRIMARY, rows: 100000 ``` ## SQL Server Scan Types ### Table Scan **Description:** Reads entire heap table (no clustered index). **When Used:** - Table has no clustered index - No suitable non-clustered index **Performance:** - Worst performance - Entire table read from disk **Optimization:** Add clustered index or non-clustered index. ### Clustered Index Scan **Description:** Reads entire clustered index (reads all table data). **When Used:** - Query needs most/all rows - No filtering or non-sargable WHERE clause **Performance:** - Similar to table scan - Entire table scanned **Example:** ```sql -- Execution plan shows "Clustered Index Scan" SELECT * FROM Orders WHERE YEAR(OrderDate) = 2025; ``` **Optimization:** Make WHERE clause sargable or add non-clustered index. ### Index Seek **Description:** Efficient index traversal to specific rows. **When Used:** - WHERE clause uses indexed columns - Sargable query conditions **Performance:** - Best performance - Minimal I/O **Example:** ```sql -- Execution plan shows "Index Seek" SELECT * FROM Orders WHERE OrderID = 12345; ``` **Key Indicator:** Seek operation in graphical plan. ### Index Scan **Description:** Reads entire index (not entire table). **When Used:** - Covering index contains all needed columns - Non-sargable conditions **Performance:** - Better than table scan if index smaller - Not as efficient as Index Seek **Example:** ```sql -- Execution plan shows "Index Scan" on non-clustered index SELECT CustomerID FROM Orders; ``` ### Key Lookup (RID Lookup / Bookmark Lookup) **Description:** After index seek, additional lookup to retrieve non-indexed columns. **When Used:** - Non-clustered index seek finds rows - Additional columns needed not in index **Performance:** - Additional I/O per row - Indicates covering index opportunity **Example:** ```sql -- Index on CustomerID, but SELECT includes other columns SELECT * FROM Orders WHERE CustomerID = 123; ``` ``` Index Seek (idx_Orders_CustomerID) -> Key Lookup (retrieve remaining columns) ``` **Optimization:** Create covering index with INCLUDE clause: ```sql CREATE NONCLUSTERED INDEX idx_Orders_CustomerID_Covering ON Orders (CustomerID) INCLUDE (OrderDate, TotalAmount); ``` ## Performance Hierarchy ### PostgreSQL (Best to Worst) 1. Index-Only Scan 2. Index Scan 3. Bitmap Heap Scan 4. Sequential Scan ### MySQL (Best to Worst) 1. system 2. const 3. eq_ref 4. ref 5. range 6. index 7. ALL ### SQL Server (Best to Worst) 1. Index Seek (non-clustered) 2. Clustered Index Seek 3. Index Scan (covering index) 4. Clustered Index Scan 5. Table Scan ## Decision Matrix: When to Add Index | Scan Type | Rows Examined | Rows Returned | Action | |-----------|---------------|---------------|--------| | Seq Scan / ALL | >10,000 | <1% | ADD INDEX | | Seq Scan / ALL | >10,000 | 1-10% | ADD INDEX | | Seq Scan / ALL | >10,000 | >50% | OK (full scan appropriate) | | Index Scan | Any | <5% | GOOD | | Bitmap Heap Scan | Any | 5-50% | GOOD (PostgreSQL optimization) | | Hash Join | Large tables | Any | Consider indexes on join columns | ## Common Optimization Patterns ### Pattern 1: Seq Scan → Index Scan **Symptom:** ``` Seq Scan on table (cost=0.00..10000.00 rows=1) Filter: (column = value) Rows Removed by Filter: 999999 ``` **Solution:** ```sql CREATE INDEX idx_table_column ON table (column); ``` ### Pattern 2: Index Scan → Index-Only Scan **Symptom:** ``` Index Scan using idx_table_column on table (cost=0.42..100.00 rows=10) Index Cond: (column = value) ``` **Solution:** Create covering index ```sql CREATE INDEX idx_table_column_covering ON table (column) INCLUDE (other_column1, other_column2); ``` ### Pattern 3: Nested Loop → Hash Join **Symptom:** ``` Nested Loop (cost=0.00..100000.00 rows=10000) -> Seq Scan on large_outer (rows=10000) -> Index Scan on large_inner ``` **Solution:** Add index to outer table to reduce rows, or let optimizer choose hash join. ### Pattern 4: Multiple Index Scans → Composite Index **Symptom:** ``` Bitmap Heap Scan on table -> Bitmap Index Scan on idx_col1 -> Bitmap Index Scan on idx_col2 ``` **Solution:** Create composite index ```sql CREATE INDEX idx_table_col1_col2 ON table (col1, col2); ``` ## Quick Reference ### Scan Type Performance | Database | Best | Good | Acceptable | Poor | |----------|------|------|------------|------| | PostgreSQL | Index-Only Scan | Index Scan, Bitmap Heap Scan | - | Seq Scan | | MySQL | const, eq_ref | ref, range | index | ALL | | SQL Server | Index Seek | Index Scan (covering) | Clustered Index Scan | Table Scan | ### When Each Scan Type is OK - **Sequential/Full Table Scan**: Small tables (<1000 rows), or query needs >50% of rows - **Index Scan**: Selective queries (<10% of rows), index available - **Index-Only Scan**: Always preferred when possible (covering index) - **Bitmap Heap Scan**: Medium result sets (PostgreSQL optimization) - **Nested Loop**: Small outer table with index on inner join column - **Hash Join**: Large tables, equality joins, no suitable indexes ``` ### references/indexing-decisions.md ```markdown # Indexing Decisions Guide Comprehensive framework for deciding when and how to add indexes to optimize SQL query performance. ## Table of Contents 1. [Index Decision Framework](#index-decision-framework) 2. [Index Selection Criteria](#index-selection-criteria) 3. [Single-Column vs Composite Indexes](#single-column-vs-composite-indexes) 4. [Covering Indexes](#covering-indexes) 5. [When NOT to Add Indexes](#when-not-to-add-indexes) 6. [Index Maintenance](#index-maintenance) ## Index Decision Framework ### Primary Decision Tree ``` Is column used in WHERE, JOIN, ORDER BY, or GROUP BY? ├─ YES → Is column selective (many unique values)? │ ├─ YES → Is table frequently queried? │ │ ├─ YES → Is table write-heavy? │ │ │ ├─ YES → Balance read vs write performance │ │ │ └─ NO → ADD INDEX ✅ │ │ └─ NO → Consider based on query frequency │ └─ NO (low selectivity) → Skip index ❌ │ Exception: Partial index for specific subset └─ NO → Skip index ❌ ``` ### Selectivity Assessment **High Selectivity (Good for Indexes):** - Primary keys (100% unique) - Email addresses (usually unique) - UUIDs (unique) - User IDs (unique per user) - Timestamps (high variety) **Low Selectivity (Poor for Indexes):** - Boolean fields (true/false) - Status fields with few values (active/inactive) - Gender fields (limited values) - Small enum fields (<10 values) **Selectivity Formula:** ```sql -- PostgreSQL SELECT COUNT(DISTINCT column_name)::float / COUNT(*)::float AS selectivity FROM table_name; ``` **Guideline:** Selectivity > 0.1 (10% unique) = candidate for index ### Query Frequency Assessment **High Frequency (Prioritize Indexing):** - User-facing queries (every page load) - API endpoints hit frequently - Real-time dashboards - Background jobs running every minute **Medium Frequency (Evaluate Trade-offs):** - Admin dashboards - Reporting queries (daily/weekly) - Batch processes (hourly) **Low Frequency (Deprioritize Indexing):** - Ad-hoc queries - One-time data migrations - Infrequent reports (monthly/quarterly) ## Index Selection Criteria ### Criteria 1: Column Usage Patterns **WHERE Clause:** ```sql SELECT * FROM orders WHERE customer_id = 123; ``` **Decision:** Index `customer_id` (equality filter) **JOIN Conditions:** ```sql SELECT * FROM orders INNER JOIN customers ON orders.customer_id = customers.id; ``` **Decision:** Index `orders.customer_id` and `customers.id` (foreign key + primary key) **ORDER BY:** ```sql SELECT * FROM orders ORDER BY created_at DESC LIMIT 10; ``` **Decision:** Index `created_at DESC` (sort optimization) **GROUP BY:** ```sql SELECT status, COUNT(*) FROM orders GROUP BY status; ``` **Decision:** Index `status` (grouping optimization) ### Criteria 2: Table Size | Table Size | Index Threshold | Reasoning | |-----------|----------------|-----------| | <1,000 rows | Skip indexes | Query planner may prefer seq scan | | 1,000-10,000 rows | Selective indexes | Index beneficial for selective queries | | 10,000-1M rows | Most queries | Indexes critical for performance | | >1M rows | All frequent queries | Index essential, consider partitioning | ### Criteria 3: Write vs Read Ratio **Read-Heavy Tables (>90% reads):** - Aggressive indexing strategy - Multiple indexes acceptable - Covering indexes beneficial **Balanced Tables (50-90% reads):** - Moderate indexing - Focus on most frequent queries - Limit to 3-5 indexes per table **Write-Heavy Tables (>50% writes):** - Minimal indexing - Only index critical queries - Consider batching writes **Write Performance Impact:** ``` Each additional index: - INSERT: +5-10% overhead per index - UPDATE: +5-10% overhead per affected index - DELETE: +5-10% overhead per index ``` ### Criteria 4: Data Type Considerations **Good for Indexing:** - Integer types (INT, BIGINT) - UUID/GUID - Timestamps (DATE, TIMESTAMP) - Short strings (VARCHAR(100)) **Acceptable for Indexing:** - Medium strings (VARCHAR(255)) - DECIMAL/NUMERIC **Poor for Indexing:** - Large TEXT/BLOB columns - Very long strings (>1000 chars) - JSON (use specialized indexes: GIN for PostgreSQL, generated columns for MySQL) **Exception:** Full-text indexes for TEXT columns ## Single-Column vs Composite Indexes ### When to Use Single-Column Index **Use Case 1: Single Filter** ```sql SELECT * FROM users WHERE email = '[email protected]'; ``` **Index:** ```sql CREATE INDEX idx_users_email ON users (email); ``` **Use Case 2: Simple JOIN** ```sql SELECT * FROM orders INNER JOIN customers ON orders.customer_id = customers.id; ``` **Index:** ```sql CREATE INDEX idx_orders_customer ON orders (customer_id); ``` ### When to Use Composite Index **Use Case 1: Multiple Equality Filters** ```sql SELECT * FROM orders WHERE customer_id = 123 AND status = 'shipped'; ``` **Index:** ```sql CREATE INDEX idx_orders_customer_status ON orders (customer_id, status); ``` **Use Case 2: Filter + Sort** ```sql SELECT * FROM orders WHERE customer_id = 123 ORDER BY created_at DESC; ``` **Index:** ```sql CREATE INDEX idx_orders_customer_created ON orders (customer_id, created_at DESC); ``` **Use Case 3: Filter + Group** ```sql SELECT customer_id, status, COUNT(*) FROM orders WHERE created_at > '2025-01-01' GROUP BY customer_id, status; ``` **Index:** ```sql CREATE INDEX idx_orders_created_customer_status ON orders (created_at, customer_id, status); ``` ### Composite Index Column Order **Rule of Thumb:** 1. **Equality filters** (most selective first) 2. **Range filters** (if any) 3. **ORDER BY columns** (matching sort direction) **Example:** ```sql SELECT * FROM orders WHERE customer_id = 123 -- Equality (put first) AND status IN ('shipped', 'pending') -- Equality (put second) AND total > 100 -- Range (put third) ORDER BY created_at DESC; -- Sort (put last) ``` **Optimal Index:** ```sql CREATE INDEX idx_orders_customer_status_total_created ON orders (customer_id, status, total, created_at DESC); ``` **Left-Prefix Rule:** Composite index on (A, B, C) can be used for: - WHERE A = ? - WHERE A = ? AND B = ? - WHERE A = ? AND B = ? AND C = ? But NOT for: - WHERE B = ? (skips leading column) - WHERE C = ? (skips leading columns) ## Covering Indexes ### What is a Covering Index? Index that contains ALL columns needed by query (no heap table access required). **Benefits:** - Fastest possible performance (Index-Only Scan) - Reduced I/O (no heap access) - Better cache utilization ### Creating Covering Indexes **PostgreSQL INCLUDE Clause:** ```sql CREATE INDEX idx_users_email_covering ON users (email) INCLUDE (id, name, created_at); ``` **MySQL Composite Index:** ```sql -- MySQL doesn't have INCLUDE, add columns to index CREATE INDEX idx_users_email_id_name ON users (email, id, name); ``` **SQL Server INCLUDE Clause:** ```sql CREATE NONCLUSTERED INDEX IX_Users_Email_Covering ON Users (Email) INCLUDE (ID, Name, CreatedAt); ``` ### When to Use Covering Indexes **Use Case 1: Frequent Query with Specific Columns** ```sql -- Query runs 1000x per second SELECT id, name FROM users WHERE email = '[email protected]'; ``` **Covering Index:** ```sql CREATE INDEX idx_users_email_covering ON users (email) INCLUDE (id, name); ``` **Use Case 2: API Endpoint Returning Specific Fields** ```sql -- API: GET /api/orders?customer_id=123 (returns id, total, status) SELECT id, total, status FROM orders WHERE customer_id = 123; ``` **Covering Index:** ```sql CREATE INDEX idx_orders_customer_covering ON orders (customer_id) INCLUDE (id, total, status); ``` ### Covering Index Trade-offs **Benefits:** - Dramatic performance improvement (Index-Only Scan) - No heap access = less I/O **Costs:** - Larger index size - Slower writes (more data to update) - More storage required **Guideline:** Use covering indexes for critical queries (user-facing, high-frequency). ## When NOT to Add Indexes ### Anti-Pattern 1: Indexing Low-Selectivity Columns **Bad:** ```sql CREATE INDEX idx_users_is_active ON users (is_active); -- ❌ Boolean ``` **Why:** Only 2 values (true/false), index scan often worse than seq scan. **Exception:** Partial index for minority case ```sql -- If 99% inactive, 1% active CREATE INDEX idx_users_active ON users (id) WHERE is_active = true; ``` ### Anti-Pattern 2: Indexing Small Tables **Bad:** ```sql CREATE INDEX idx_config_key ON config (key); -- ❌ Table has 50 rows ``` **Why:** Small tables fit in memory, seq scan faster than index overhead. **Guideline:** Skip indexes for tables <1000 rows (unless foreign keys). ### Anti-Pattern 3: Over-Indexing **Bad:** ```sql CREATE INDEX idx_users_email ON users (email); CREATE INDEX idx_users_email_name ON users (email, name); -- ❌ Redundant CREATE INDEX idx_users_email_name_id ON users (email, name, id); -- ❌ Redundant ``` **Why:** Multiple overlapping indexes waste space and slow writes. **Fix:** Use single covering index ```sql CREATE INDEX idx_users_email_covering ON users (email) INCLUDE (name, id); ``` ### Anti-Pattern 4: Indexing Write-Heavy Columns **Bad:** ```sql CREATE INDEX idx_pageviews_timestamp ON pageviews (timestamp); -- ❌ High-insert table ``` **Why:** Each insert updates index, slowing down write-heavy operations. **Alternative:** Partition table by time range, index within partitions. ### Anti-Pattern 5: Indexing Calculated Columns (Without Expression Index) **Bad:** ```sql -- Query uses function SELECT * FROM users WHERE LOWER(email) = '[email protected]'; -- Regular index won't help CREATE INDEX idx_users_email ON users (email); -- ❌ Not used ``` **Fix:** Expression index (PostgreSQL, SQL Server) ```sql CREATE INDEX idx_users_email_lower ON users (LOWER(email)); ``` ## Index Maintenance ### Monitoring Index Usage **PostgreSQL:** ```sql -- Find unused indexes SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read, idx_tup_fetch FROM pg_stat_user_indexes WHERE idx_scan = 0 AND indexname NOT LIKE 'pg_toast_%' ORDER BY idx_scan; ``` **MySQL:** ```sql -- Enable index statistics SELECT * FROM sys.schema_unused_indexes; ``` **SQL Server:** ```sql -- Find unused indexes SELECT OBJECT_NAME(i.object_id) AS TableName, i.name AS IndexName, s.user_seeks, s.user_scans, s.user_lookups, s.user_updates FROM sys.indexes i LEFT JOIN sys.dm_db_index_usage_stats s ON i.object_id = s.object_id AND i.index_id = s.index_id WHERE s.user_seeks = 0 AND s.user_scans = 0 AND s.user_lookups = 0 AND i.type_desc = 'NONCLUSTERED'; ``` ### Removing Unused Indexes **Process:** 1. Identify unused indexes (0 scans over 7+ days) 2. Verify index not used for constraints 3. Drop index during low-traffic period 4. Monitor for performance regressions **Drop Command:** ```sql DROP INDEX idx_unused_index ON table_name; ``` ### Updating Index Statistics **PostgreSQL:** ```sql -- Analyze entire database ANALYZE; -- Analyze specific table ANALYZE table_name; ``` **MySQL:** ```sql -- Analyze table ANALYZE TABLE table_name; ``` **SQL Server:** ```sql -- Update statistics UPDATE STATISTICS table_name; -- Update statistics for specific index UPDATE STATISTICS table_name index_name; ``` **Schedule:** Weekly for active tables, monthly for stable tables. ### Rebuilding Fragmented Indexes **PostgreSQL:** ```sql -- Rebuild index (locks table) REINDEX INDEX idx_name; -- Rebuild concurrently (no lock) REINDEX INDEX CONCURRENTLY idx_name; ``` **MySQL:** ```sql -- Optimize table (rebuilds indexes) OPTIMIZE TABLE table_name; ``` **SQL Server:** ```sql -- Rebuild index ALTER INDEX index_name ON table_name REBUILD; -- Reorganize index (online, less intrusive) ALTER INDEX index_name ON table_name REORGANIZE; ``` **When to Rebuild:** - Fragmentation >30% - After large batch deletes - Query performance degrades over time ## Index Design Patterns ### Pattern 1: Foreign Key Indexing **Rule:** Always index foreign keys. **Example:** ```sql CREATE TABLE orders ( id INT PRIMARY KEY, customer_id INT, FOREIGN KEY (customer_id) REFERENCES customers(id) ); -- Always add this index CREATE INDEX idx_orders_customer ON orders (customer_id); ``` **Why:** Enables efficient joins and cascading deletes. ### Pattern 2: Status + Timestamp Indexing **Use Case:** Queries filtering by status and ordering by time. ```sql SELECT * FROM orders WHERE status = 'pending' ORDER BY created_at DESC LIMIT 10; ``` **Index:** ```sql CREATE INDEX idx_orders_status_created ON orders (status, created_at DESC); ``` ### Pattern 3: Multi-Tenant Indexing **Use Case:** SaaS application with tenant_id in most queries. ```sql SELECT * FROM documents WHERE tenant_id = 123 AND status = 'active' ORDER BY updated_at DESC; ``` **Index:** ```sql CREATE INDEX idx_documents_tenant_status_updated ON documents (tenant_id, status, updated_at DESC); ``` **Rule:** Always lead with tenant_id in multi-tenant applications. ### Pattern 4: Partial Index for Subset **Use Case:** Query only active records (minority of data). ```sql SELECT * FROM users WHERE status = 'active'; ``` **Full Index (Inefficient):** ```sql CREATE INDEX idx_users_status ON users (status); -- Large index ``` **Partial Index (Efficient - PostgreSQL):** ```sql CREATE INDEX idx_users_active ON users (id) WHERE status = 'active'; -- Smaller, faster ``` ### Pattern 5: Expression Index for Case-Insensitive Search **Use Case:** Case-insensitive email lookup. ```sql SELECT * FROM users WHERE LOWER(email) = '[email protected]'; ``` **Expression Index (PostgreSQL, SQL Server):** ```sql CREATE INDEX idx_users_email_lower ON users (LOWER(email)); ``` **MySQL Alternative (Generated Column):** ```sql ALTER TABLE users ADD COLUMN email_lower VARCHAR(255) GENERATED ALWAYS AS (LOWER(email)) STORED; CREATE INDEX idx_users_email_lower ON users (email_lower); ``` ## Quick Reference ### Index Decision Checklist - [ ] Column used in WHERE, JOIN, ORDER BY, or GROUP BY? - [ ] High selectivity (>10% unique values)? - [ ] Table size >1,000 rows? - [ ] Query frequency high (user-facing or frequent)? - [ ] Write-to-read ratio acceptable (<50% writes)? - [ ] No overlapping composite indexes already exist? - [ ] Index size acceptable (<20% of table size)? ### Index Type Selection | Use Case | PostgreSQL | MySQL | SQL Server | |----------|-----------|-------|------------| | General queries | B-tree | B-tree | Non-clustered | | Equality only | Hash | B-tree | Non-clustered | | Full-text search | GIN | Full-text | Full-text | | Spatial data | GiST | Spatial | Spatial | | JSONB | GIN | Generated column + index | JSON index | | Large sequential table | BRIN | B-tree with partitions | Clustered columnstore | ### Index Maintenance Schedule | Task | Frequency | Purpose | |------|-----------|---------| | Update statistics | Weekly | Optimize query plans | | Review unused indexes | Monthly | Remove waste | | Rebuild fragmented indexes | Quarterly | Fix fragmentation | | Analyze query performance | Weekly | Identify missing indexes | ``` ### references/index-types.md ```markdown # Index Types by Database Quick reference guide to index types across PostgreSQL, MySQL, and SQL Server. ## PostgreSQL Index Types | Index Type | Use Case | Operators Supported | Creation | |-----------|----------|-------------------|----------| | **B-tree** | General-purpose (default) | <, ≤, =, ≥, >, BETWEEN, IN, IS NULL | `CREATE INDEX ON table (column)` | | **Hash** | Equality only | = | `CREATE INDEX ON table USING HASH (column)` | | **GIN** | Full-text, JSONB, arrays | @>, ?, ?&, ?|, @@ | `CREATE INDEX ON table USING GIN (column)` | | **GiST** | Spatial, ranges, full-text | &&, <->, @>, <<, &< | `CREATE INDEX ON table USING GIST (column)` | | **SP-GiST** | Non-balanced trees, partitioned | Same as GiST | `CREATE INDEX ON table USING SPGIST (column)` | | **BRIN** | Large sequential tables | <, ≤, =, ≥, > | `CREATE INDEX ON table USING BRIN (column)` | | **Bloom** | Multi-column equality | = (multiple columns) | `CREATE INDEX ON table USING BLOOM (col1, col2, ...)` | **Recommendations:** - **Default:** B-tree (99% of use cases) - **Full-text:** GIN on `tsvector` - **JSONB:** GIN - **Spatial:** GiST - **Time-series >100GB:** BRIN ## MySQL Index Types | Index Type | Use Case | Storage Engines | Creation | |-----------|----------|----------------|----------| | **B-tree** | General-purpose (default) | InnoDB, MyISAM | `CREATE INDEX ON table (column)` | | **Hash** | Equality only | MEMORY engine only | `CREATE INDEX USING HASH ON table (column)` | | **Full-text** | Text search | InnoDB (5.6+), MyISAM | `CREATE FULLTEXT INDEX ON table (column)` | | **Spatial** | Geometric data | InnoDB (5.7+), MyISAM | `CREATE SPATIAL INDEX ON table (column)` | **Recommendations:** - **Default:** B-tree - **Text search:** Full-text index - **Spatial:** Spatial index ## SQL Server Index Types | Index Type | Use Case | Clustered | Creation | |-----------|----------|-----------|----------| | **Clustered** | Primary table organization | Yes (1 per table) | `CREATE CLUSTERED INDEX ON table (column)` | | **Non-Clustered** | Secondary lookups | No (multiple per table) | `CREATE NONCLUSTERED INDEX ON table (column)` | | **Covering (INCLUDE)** | Index-only scans | No | `CREATE INDEX ON table (col) INCLUDE (col2, ...)` | | **Filtered** | Partial index | No | `CREATE INDEX ON table (col) WHERE condition` | | **Columnstore** | Analytics/DW | Yes or No | `CREATE COLUMNSTORE INDEX ON table` | | **Full-text** | Text search | No | `CREATE FULLTEXT INDEX ON table (column)` | | **Spatial** | Geometric data | No | `CREATE SPATIAL INDEX ON table (column)` | | **XML** | XML data | No | `CREATE XML INDEX ON table (column)` | **Recommendations:** - **Primary key:** Clustered index (default) - **Foreign keys:** Non-clustered index - **Frequent queries:** Covering index - **Analytics:** Columnstore index ## Cross-Database Index Comparison ### General-Purpose Indexes | Database | Name | Notes | |----------|------|-------| | PostgreSQL | B-tree | Default, most common | | MySQL | B-tree | InnoDB uses clustered primary key | | SQL Server | Non-Clustered | Separate from table data | ### Full-Text Search | Database | Implementation | Query Syntax | |----------|---------------|--------------| | PostgreSQL | GIN + tsvector | `to_tsvector() @@ to_tsquery()` | | MySQL | Full-text index | `MATCH() AGAINST()` | | SQL Server | Full-text index | `CONTAINS()`, `FREETEXT()` | ### Partial/Filtered Indexes | Database | Support | Syntax | |----------|---------|--------| | PostgreSQL | Yes (Partial Index) | `CREATE INDEX ... WHERE condition` | | MySQL | No | Use generated columns as workaround | | SQL Server | Yes (Filtered Index) | `CREATE INDEX ... WHERE condition` | ### Expression/Computed Indexes | Database | Support | Syntax | |----------|---------|--------| | PostgreSQL | Yes (Expression Index) | `CREATE INDEX ON table (LOWER(column))` | | MySQL | Via Generated Columns | `ADD COLUMN ... GENERATED ... + INDEX` | | SQL Server | Via Computed Columns | `ADD COLUMN ... AS expression + INDEX` | ### Covering Indexes | Database | Implementation | Syntax | |----------|---------------|--------| | PostgreSQL | INCLUDE clause | `CREATE INDEX ON t (col) INCLUDE (col2, col3)` | | MySQL | Add columns to index | `CREATE INDEX ON t (col, col2, col3)` | | SQL Server | INCLUDE clause | `CREATE INDEX ON t (col) INCLUDE (col2, col3)` | ## Index Selection Decision Tree ``` What type of query? ├─ Equality (column = value) │ ├─ PostgreSQL → B-tree or Hash │ ├─ MySQL → B-tree │ └─ SQL Server → Non-clustered │ ├─ Range (column > value, BETWEEN) │ ├─ PostgreSQL → B-tree │ ├─ MySQL → B-tree │ └─ SQL Server → Non-clustered │ ├─ Full-text search │ ├─ PostgreSQL → GIN (tsvector) │ ├─ MySQL → Full-text │ └─ SQL Server → Full-text │ ├─ JSON queries │ ├─ PostgreSQL → GIN (JSONB) │ ├─ MySQL → Generated column + B-tree │ └─ SQL Server → JSON index (2016+) │ ├─ Spatial queries │ ├─ PostgreSQL → GiST (PostGIS) │ ├─ MySQL → Spatial │ └─ SQL Server → Spatial │ └─ Large time-series table ├─ PostgreSQL → BRIN ├─ MySQL → Partitioning + B-tree └─ SQL Server → Partitioning + Columnstore ``` ``` ### references/composite-indexes.md ```markdown # Composite Index Design Guide to designing multi-column composite indexes for optimal query performance. ## Column Order Rules ### Rule 1: Equality Filters First (Most Selective) **Query Pattern:** ```sql SELECT * FROM orders WHERE customer_id = 123 AND status = 'shipped'; ``` **Optimal Index:** ```sql CREATE INDEX idx_orders_customer_status ON orders (customer_id, status); ``` **Why:** Most selective filters first reduce index tree traversal. ### Rule 2: Range Filters After Equality **Query Pattern:** ```sql SELECT * FROM orders WHERE customer_id = 123 AND created_at > '2025-01-01'; ``` **Optimal Index:** ```sql CREATE INDEX idx_orders_customer_created ON orders (customer_id, created_at); ``` **Why:** Equality narrows down to specific branch, then range scan within. ### Rule 3: ORDER BY Columns Last **Query Pattern:** ```sql SELECT * FROM orders WHERE customer_id = 123 ORDER BY created_at DESC LIMIT 10; ``` **Optimal Index:** ```sql CREATE INDEX idx_orders_customer_created ON orders (customer_id, created_at DESC); ``` **Why:** Pre-sorted results, no separate sort operation needed. ## Left-Prefix Rule **Composite Index:** `(A, B, C)` **Can Be Used For:** - `WHERE A = ?` - `WHERE A = ? AND B = ?` - `WHERE A = ? AND B = ? AND C = ?` - `WHERE A = ? ORDER BY B` **Cannot Be Used For:** - `WHERE B = ?` (skips leading column A) - `WHERE C = ?` (skips leading columns A, B) - `WHERE B = ? AND C = ?` (skips leading column A) **Example:** ```sql -- Index: (customer_id, status, created_at) -- ✅ Uses index (customer_id) SELECT * FROM orders WHERE customer_id = 123; -- ✅ Uses index (customer_id, status) SELECT * FROM orders WHERE customer_id = 123 AND status = 'pending'; -- ✅ Uses full index SELECT * FROM orders WHERE customer_id = 123 AND status = 'pending' AND created_at > '2025-01-01'; -- ❌ Cannot use index (skips customer_id) SELECT * FROM orders WHERE status = 'pending'; ``` ## Common Patterns ### Pattern 1: Multi-Tenant Application **Query:** ```sql SELECT * FROM documents WHERE tenant_id = 123 AND status = 'active' ORDER BY updated_at DESC; ``` **Index:** ```sql CREATE INDEX idx_documents_tenant_status_updated ON documents (tenant_id, status, updated_at DESC); ``` **Why:** tenant_id first (always filtered), status second, updated_at for sorting. ### Pattern 2: Status + Time Range **Query:** ```sql SELECT * FROM tasks WHERE status = 'pending' AND due_date < NOW(); ``` **Index:** ```sql CREATE INDEX idx_tasks_status_due ON tasks (status, due_date); ``` ### Pattern 3: Multiple Equality + Range **Query:** ```sql SELECT * FROM products WHERE category_id = 5 AND brand_id = 10 AND price > 100; ``` **Index:** ```sql CREATE INDEX idx_products_category_brand_price ON products (category_id, brand_id, price); ``` **Column Order:** Most selective first, range last. ### Pattern 4: JOIN + Filter **Query:** ```sql SELECT * FROM order_items INNER JOIN orders ON order_items.order_id = orders.id WHERE orders.customer_id = 123; ``` **Indexes:** ```sql -- Index on order_items for JOIN CREATE INDEX idx_order_items_order ON order_items (order_id); -- Composite index on orders CREATE INDEX idx_orders_customer ON orders (customer_id); ``` ## Selectivity Ordering ### High Selectivity → Low Selectivity **Table:** 1,000,000 orders - `customer_id`: 10,000 unique values (high selectivity) - `status`: 5 unique values (low selectivity) **Query:** ```sql SELECT * FROM orders WHERE customer_id = 123 AND status = 'pending'; ``` **Optimal:** ```sql CREATE INDEX idx_orders_customer_status ON orders (customer_id, status); ``` **Why:** - `customer_id = 123` narrows to ~100 rows - `status = 'pending'` filters within those 100 rows - Much better than filtering 200,000 pending orders for customer_id **Suboptimal:** ```sql CREATE INDEX idx_orders_status_customer ON orders (status, customer_id); ``` - `status = 'pending'` starts with 200,000 rows - Then filters for customer_id - Larger initial scan ## Index vs Query Mismatch ### Mismatch Example **Index:** ```sql CREATE INDEX idx_orders_customer_status ON orders (customer_id, status); ``` **Query:** ```sql -- ❌ Query uses status first (skips customer_id) SELECT * FROM orders WHERE status = 'pending'; ``` **Solution:** Create separate index for status-only queries: ```sql CREATE INDEX idx_orders_status ON orders (status); ``` ### Partial Index Solution (PostgreSQL) Instead of full index on low-selectivity column: ```sql -- ✅ Partial index for specific status CREATE INDEX idx_orders_pending ON orders (created_at) WHERE status = 'pending'; ``` ## Including Non-Indexed Columns ### PostgreSQL INCLUDE Clause **Query:** ```sql SELECT customer_id, status, total, created_at FROM orders WHERE customer_id = 123; ``` **Covering Index:** ```sql CREATE INDEX idx_orders_customer_covering ON orders (customer_id) INCLUDE (status, total, created_at); ``` **Benefit:** Index-Only Scan (no heap access). ### MySQL Approach **Composite Index with Extra Columns:** ```sql CREATE INDEX idx_orders_customer_covering ON orders (customer_id, status, total, created_at); ``` **Trade-off:** Larger index, but enables covering scans. ### SQL Server INCLUDE Clause **Covering Index:** ```sql CREATE NONCLUSTERED INDEX IX_Orders_CustomerID ON Orders (CustomerID) INCLUDE (Status, Total, CreatedAt); ``` ## Multi-Column vs Multiple Indexes ### Single Composite Index **Index:** ```sql CREATE INDEX idx_orders_customer_status ON orders (customer_id, status); ``` **Queries Supported:** - `WHERE customer_id = ?` - `WHERE customer_id = ? AND status = ?` **Queries NOT Supported:** - `WHERE status = ?` (skips leading column) ### Multiple Single-Column Indexes **Indexes:** ```sql CREATE INDEX idx_orders_customer ON orders (customer_id); CREATE INDEX idx_orders_status ON orders (status); ``` **Queries Supported:** - `WHERE customer_id = ?` (uses idx_orders_customer) - `WHERE status = ?` (uses idx_orders_status) - `WHERE customer_id = ? AND status = ?` (bitmap scan in PostgreSQL, index merge in MySQL) **Trade-off:** - More indexes = slower writes - More flexible for varied query patterns ### Recommendation **High-frequency query patterns:** Composite index **Varied query patterns:** Multiple indexes or partial indexes ## Testing Composite Indexes ### Before Creating Index ```sql EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 123 AND status = 'pending' ORDER BY created_at DESC; ``` **Look for:** - Sequential Scan / Table Scan - High row counts - Sort operation ### After Creating Index ```sql CREATE INDEX idx_orders_customer_status_created ON orders (customer_id, status, created_at DESC); EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 123 AND status = 'pending' ORDER BY created_at DESC; ``` **Look for:** - Index Scan / Index Seek - Low row counts - No sort operation (pre-sorted by index) ## Quick Reference ### Column Order Priority 1. **Equality filters** (most selective first) 2. **Additional equality filters** (by selectivity) 3. **Range filters** (after all equality) 4. **ORDER BY columns** (matching sort direction) 5. **GROUP BY columns** (if not already covered) ### Common Mistakes **❌ Wrong order:** ```sql CREATE INDEX ON orders (created_at, customer_id); -- Query: WHERE customer_id = ? ORDER BY created_at -- Inefficient: Large scan on created_at first ``` **✅ Correct order:** ```sql CREATE INDEX ON orders (customer_id, created_at); -- Query: WHERE customer_id = ? ORDER BY created_at -- Efficient: Narrow by customer_id, then sorted by created_at ``` ### Index Size Considerations **Each additional column:** - Increases index size by column width - Slows down write operations slightly - Enables more query patterns **Balance:** - 2-4 columns typical - 5+ columns rare (diminishing returns) ``` ### references/anti-patterns.md ```markdown # SQL Anti-Patterns Common SQL performance anti-patterns with explanations, impact analysis, and solutions. ## Table of Contents 1. [SELECT * Anti-Pattern](#select--anti-pattern) 2. [N+1 Query Problem](#n1-query-problem) 3. [Missing Indexes on Foreign Keys](#missing-indexes-on-foreign-keys) 4. [Non-Sargable Queries](#non-sargable-queries) 5. [Implicit Type Conversion](#implicit-type-conversion) 6. [Correlated Subqueries](#correlated-subqueries) 7. [Unnecessary DISTINCT](#unnecessary-distinct) 8. [OR vs IN Performance](#or-vs-in-performance) 9. [NOT IN with NULL Values](#not-in-with-null-values) 10. [Wildcard at Start of LIKE](#wildcard-at-start-of-like) ## SELECT * Anti-Pattern ### Problem Description Fetching all columns when only subset needed. **Anti-Pattern:** ```sql -- ❌ Bad: Fetches all 50 columns SELECT * FROM users WHERE id = 1; ``` **Impact:** - Increased I/O (reading unnecessary data from disk) - Higher network transfer (sending unnecessary data) - More memory usage (larger result sets) - Slower query execution - Breaks application when schema changes **Solution:** ```sql -- ✅ Good: Fetch only needed columns SELECT id, name, email, created_at FROM users WHERE id = 1; ``` ### When SELECT * is Acceptable **Exception 1: Small tables with few columns** ```sql SELECT * FROM settings; -- OK: 3-5 columns, small table ``` **Exception 2: Exploratory queries (development only)** ```sql -- OK during development/debugging SELECT * FROM users LIMIT 5; ``` **Exception 3: All columns genuinely needed** ```sql -- OK if truly need every column SELECT * FROM user_profiles WHERE user_id = 123; ``` ### Performance Comparison **Test Case:** Users table with 50 columns, 100,000 rows ```sql -- SELECT * : 250ms, 500MB transferred SELECT * FROM users WHERE status = 'active'; -- Specific columns: 50ms, 50MB transferred SELECT id, name, email FROM users WHERE status = 'active'; ``` **Result:** 5x performance improvement ## N+1 Query Problem ### Problem Description Executing 1 query to fetch parent records, then N queries (one per parent) to fetch related records. **Anti-Pattern:** ```sql -- ❌ Bad: 1 + N queries -- Query 1: Fetch users SELECT * FROM users LIMIT 100; -- Query 2-101: For each user, fetch posts (executed 100 times in application loop) SELECT * FROM posts WHERE user_id = ?; ``` **Impact:** - 101 database round trips instead of 1 - Network latency multiplied by N - Database connection overhead × N - Poor scalability (linear growth with N) **Solution 1: Single JOIN** ```sql -- ✅ Good: Single query with JOIN SELECT users.id, users.name, posts.id AS post_id, posts.title, posts.content FROM users LEFT JOIN posts ON users.id = posts.user_id WHERE users.id IN (1, 2, 3, ...); ``` **Solution 2: Separate Queries with IN Clause** ```sql -- ✅ Also good: 2 queries instead of N+1 -- Query 1: Fetch users SELECT * FROM users LIMIT 100; -- Query 2: Fetch all posts for these users SELECT * FROM posts WHERE user_id IN (1, 2, 3, ..., 100); ``` ### N+1 Detection **Rails ActiveRecord Example:** ```ruby # ❌ N+1 Problem users = User.limit(100) users.each do |user| puts user.posts.count # Triggers N queries end # ✅ Fixed with eager loading users = User.includes(:posts).limit(100) users.each do |user| puts user.posts.count # Uses preloaded data end ``` **Django ORM Example:** ```python # ❌ N+1 Problem users = User.objects.all()[:100] for user in users: print(user.posts.count()) # Triggers N queries # ✅ Fixed with select_related / prefetch_related users = User.objects.prefetch_related('posts').all()[:100] for user in users: print(user.posts.count()) # Uses preloaded data ``` ### Performance Comparison **Test Case:** 100 users, average 5 posts each ```sql -- N+1 approach: 101 queries, ~1000ms -- JOIN approach: 1 query, ~50ms ``` **Result:** 20x performance improvement ## Missing Indexes on Foreign Keys ### Problem Description Foreign key columns without indexes cause slow joins and cascading operations. **Anti-Pattern:** ```sql -- ❌ Bad: No index on foreign key CREATE TABLE orders ( id INT PRIMARY KEY, customer_id INT, -- No index! total DECIMAL(10,2), FOREIGN KEY (customer_id) REFERENCES customers(id) ); ``` **Impact:** - Slow joins (sequential scan on orders table) - Slow cascading deletes (must scan entire orders table) - Slow cascading updates - Poor performance for queries filtering by customer_id **Solution:** ```sql -- ✅ Good: Index on foreign key CREATE TABLE orders ( id INT PRIMARY KEY, customer_id INT, total DECIMAL(10,2), FOREIGN KEY (customer_id) REFERENCES customers(id) ); CREATE INDEX idx_orders_customer ON orders (customer_id); ``` ### Why This Matters **Query Without Index:** ```sql EXPLAIN SELECT * FROM orders WHERE customer_id = 123; ``` ``` Seq Scan on orders (cost=0.00..10000.00 rows=100) Filter: (customer_id = 123) ``` **100,000 rows scanned** **Query With Index:** ```sql EXPLAIN SELECT * FROM orders WHERE customer_id = 123; ``` ``` Index Scan using idx_orders_customer on orders (cost=0.42..12.44 rows=100) Index Cond: (customer_id = 123) ``` **100 rows accessed directly** ### Cascading Delete Performance **Without Index:** ```sql DELETE FROM customers WHERE id = 123; -- Scans entire orders table to find matching rows -- Time: 500ms for 1 million row orders table ``` **With Index:** ```sql DELETE FROM customers WHERE id = 123; -- Uses index to find matching rows directly -- Time: 5ms ``` **Result:** 100x faster cascading deletes ## Non-Sargable Queries ### Problem Description **Sargable:** Search ARGument ABLE - conditions that can use indexes. **Non-sargable:** Functions or operations on indexed columns prevent index usage. ### Anti-Pattern 1: Function on Indexed Column **Anti-Pattern:** ```sql -- ❌ Bad: Function prevents index usage SELECT * FROM orders WHERE YEAR(created_at) = 2025; ``` **Why It Fails:** - Index on `created_at` cannot be used - Database must evaluate YEAR() for every row - Results in sequential scan **Solution:** ```sql -- ✅ Good: Sargable range condition SELECT * FROM orders WHERE created_at >= '2025-01-01' AND created_at < '2026-01-01'; ``` ### Anti-Pattern 2: Arithmetic on Indexed Column **Anti-Pattern:** ```sql -- ❌ Bad: Arithmetic prevents index usage SELECT * FROM products WHERE price * 1.1 > 100; ``` **Solution:** ```sql -- ✅ Good: Move arithmetic to other side SELECT * FROM products WHERE price > 100 / 1.1; ``` ### Anti-Pattern 3: String Concatenation **Anti-Pattern:** ```sql -- ❌ Bad: Concatenation prevents index usage SELECT * FROM users WHERE first_name || ' ' || last_name = 'John Doe'; ``` **Solution 1: Separate Conditions** ```sql -- ✅ Better: Use separate conditions SELECT * FROM users WHERE first_name = 'John' AND last_name = 'Doe'; ``` **Solution 2: Computed Column (SQL Server, MySQL)** ```sql -- Add computed column with index ALTER TABLE users ADD full_name AS (first_name + ' ' + last_name); CREATE INDEX idx_users_full_name ON users (full_name); ``` ### Anti-Pattern 4: LIKE with Leading Wildcard **Anti-Pattern:** ```sql -- ❌ Bad: Leading wildcard prevents index usage SELECT * FROM users WHERE email LIKE '%@example.com'; ``` **Why It Fails:** - Cannot use B-tree index for prefix search - Must scan entire table **Solution 1: Trailing Wildcard (if applicable)** ```sql -- ✅ Good: Trailing wildcard can use index SELECT * FROM users WHERE email LIKE 'john%'; ``` **Solution 2: Full-Text Index** ```sql -- PostgreSQL: Use trigram index CREATE INDEX idx_users_email_trgm ON users USING GIN (email gin_trgm_ops); SELECT * FROM users WHERE email LIKE '%@example.com'; -- MySQL: Full-text index CREATE FULLTEXT INDEX idx_users_email_fulltext ON users (email); SELECT * FROM users WHERE MATCH(email) AGAINST('example.com'); ``` ### Sargable vs Non-Sargable Examples | Non-Sargable (Bad) | Sargable (Good) | |-------------------|----------------| | `WHERE YEAR(date) = 2025` | `WHERE date >= '2025-01-01' AND date < '2026-01-01'` | | `WHERE LOWER(email) = '[email protected]'` | Use expression index on `LOWER(email)` | | `WHERE price * 1.1 > 100` | `WHERE price > 100 / 1.1` | | `WHERE column + 10 = 50` | `WHERE column = 40` | | `WHERE email LIKE '%@example.com'` | `WHERE email LIKE 'john%'` or use full-text | ## Implicit Type Conversion ### Problem Description Comparing different data types forces type conversion, preventing index usage. **Anti-Pattern:** ```sql -- ❌ Bad: user_id is INT, '123' is VARCHAR SELECT * FROM users WHERE user_id = '123'; ``` **Why It Fails:** - Database converts every row's user_id to string - Index on user_id cannot be used efficiently - Sequential scan likely **EXPLAIN Output:** ``` Seq Scan on users (cost=0.00..1500.00 rows=1) Filter: ((user_id)::text = '123'::text) ``` **Solution:** ```sql -- ✅ Good: Matching types SELECT * FROM users WHERE user_id = 123; ``` **EXPLAIN Output:** ``` Index Scan using idx_users_id on users (cost=0.42..8.44 rows=1) Index Cond: (user_id = 123) ``` ### Common Type Mismatch Scenarios **Scenario 1: String to Number** ```sql -- ❌ Bad SELECT * FROM orders WHERE order_id = '12345'; -- order_id is INT -- ✅ Good SELECT * FROM orders WHERE order_id = 12345; ``` **Scenario 2: Date String Comparison** ```sql -- ❌ Bad: String comparison on DATE column SELECT * FROM events WHERE event_date = '2025-01-01'; -- Implicit conversion -- ✅ Good: Explicit DATE type SELECT * FROM events WHERE event_date = DATE '2025-01-01'; ``` **Scenario 3: UUID Format** ```sql -- PostgreSQL: UUID column -- ❌ Bad SELECT * FROM records WHERE uuid_column = 'a1b2c3d4-e5f6-7890-abcd-ef1234567890'; -- ✅ Good SELECT * FROM records WHERE uuid_column = 'a1b2c3d4-e5f6-7890-abcd-ef1234567890'::uuid; ``` ## Correlated Subqueries ### Problem Description Subquery executes once per row in outer query, resulting in poor performance. **Anti-Pattern:** ```sql -- ❌ Bad: Correlated subquery SELECT name, (SELECT COUNT(*) FROM orders WHERE orders.user_id = users.id) AS order_count FROM users; ``` **Why It Fails:** - Subquery executes once per user - For 10,000 users → 10,000 subquery executions - Even with indexes, overhead is significant **Solution 1: JOIN with GROUP BY** ```sql -- ✅ Good: Single query with JOIN SELECT users.name, COUNT(orders.id) AS order_count FROM users LEFT JOIN orders ON users.id = orders.user_id GROUP BY users.id, users.name; ``` **Solution 2: Lateral Join (PostgreSQL)** ```sql -- ✅ Good: Lateral join for complex subqueries SELECT users.name, recent_orders.order_count FROM users LEFT JOIN LATERAL ( SELECT COUNT(*) AS order_count FROM orders WHERE orders.user_id = users.id AND orders.created_at > NOW() - INTERVAL '30 days' ) recent_orders ON true; ``` ### Performance Comparison **Test Case:** 10,000 users, 100,000 orders ```sql -- Correlated subquery: ~5000ms -- JOIN with GROUP BY: ~100ms ``` **Result:** 50x performance improvement ## Unnecessary DISTINCT ### Problem Description Using DISTINCT when data is already unique adds expensive deduplication step. **Anti-Pattern:** ```sql -- ❌ Bad: DISTINCT on primary key (already unique) SELECT DISTINCT id FROM users WHERE status = 'active'; ``` **Why It Fails:** - DISTINCT requires sorting or hashing - Unnecessary overhead when uniqueness guaranteed - Wasted CPU and memory **Solution:** ```sql -- ✅ Good: Remove DISTINCT (id is unique) SELECT id FROM users WHERE status = 'active'; ``` ### When DISTINCT is Necessary **Necessary Use Case:** ```sql -- ✅ Good: DISTINCT needed for duplicate customer_ids SELECT DISTINCT customer_id FROM orders WHERE status = 'completed'; ``` **Alternative (often better):** ```sql -- ✅ Better: Use GROUP BY (can add aggregations) SELECT customer_id FROM orders WHERE status = 'completed' GROUP BY customer_id; ``` ### DISTINCT in JOINs **Anti-Pattern:** ```sql -- ❌ Bad: DISTINCT to fix JOIN duplication SELECT DISTINCT users.id, users.name FROM users INNER JOIN orders ON users.id = orders.user_id; ``` **Why It's Wrong:** - DISTINCT is band-aid for incorrect JOIN - Expensive deduplication **Solution:** ```sql -- ✅ Good: Use EXISTS or LEFT JOIN properly SELECT id, name FROM users WHERE EXISTS (SELECT 1 FROM orders WHERE orders.user_id = users.id); -- Or if you need aggregation: SELECT users.id, users.name, COUNT(orders.id) AS order_count FROM users INNER JOIN orders ON users.id = orders.user_id GROUP BY users.id, users.name; ``` ## OR vs IN Performance ### Problem Description Multiple OR conditions can be slower than IN clause or UNION ALL. **Anti-Pattern:** ```sql -- ❌ Suboptimal: Multiple OR conditions SELECT * FROM users WHERE status = 'active' OR status = 'pending' OR status = 'verified' OR status = 'trial'; ``` **Why It's Suboptimal:** - May not use index efficiently - Query planner may choose sequential scan **Solution 1: Use IN** ```sql -- ✅ Better: IN clause SELECT * FROM users WHERE status IN ('active', 'pending', 'verified', 'trial'); ``` **Solution 2: Use UNION ALL (if separate indexes exist)** ```sql -- ✅ Alternative: UNION ALL with separate index scans SELECT * FROM users WHERE status = 'active' UNION ALL SELECT * FROM users WHERE status = 'pending' UNION ALL SELECT * FROM users WHERE status = 'verified' UNION ALL SELECT * FROM users WHERE status = 'trial'; ``` ### OR on Different Columns **Anti-Pattern:** ```sql -- ❌ Bad: OR on different columns SELECT * FROM users WHERE email = '[email protected]' OR phone = '555-1234'; ``` **Why It Fails:** - Cannot use indexes effectively - Often results in sequential scan **Solution: UNION ALL** ```sql -- ✅ Good: UNION ALL allows index usage on both columns SELECT * FROM users WHERE email = '[email protected]' UNION ALL SELECT * FROM users WHERE phone = '555-1234'; ``` ## NOT IN with NULL Values ### Problem Description NOT IN with NULL values returns unexpected results (empty set). **Anti-Pattern:** ```sql -- ❌ Bad: NOT IN with potential NULLs SELECT * FROM users WHERE id NOT IN (SELECT user_id FROM deleted_users); -- Returns ZERO rows if any user_id is NULL! ``` **Why It Fails:** - SQL three-valued logic (TRUE, FALSE, NULL) - NULL in list makes entire NOT IN return NULL - NULL is not TRUE, so row is excluded **Solution 1: NOT EXISTS** ```sql -- ✅ Good: NOT EXISTS handles NULLs correctly SELECT * FROM users WHERE NOT EXISTS ( SELECT 1 FROM deleted_users WHERE deleted_users.user_id = users.id ); ``` **Solution 2: Filter NULLs in Subquery** ```sql -- ✅ Also good: Explicitly exclude NULLs SELECT * FROM users WHERE id NOT IN ( SELECT user_id FROM deleted_users WHERE user_id IS NOT NULL ); ``` ### Performance Comparison ```sql -- NOT IN: May be slower, fails with NULLs -- NOT EXISTS: Usually faster, handles NULLs correctly ``` ## Wildcard at Start of LIKE ### Problem Description Leading wildcard in LIKE prevents index usage. **Anti-Pattern:** ```sql -- ❌ Bad: Leading wildcard SELECT * FROM products WHERE name LIKE '%widget%'; ``` **Why It Fails:** - B-tree index cannot be used - Must scan entire table - Evaluate LIKE for every row **Solution 1: Trailing Wildcard (if applicable)** ```sql -- ✅ Good: Trailing wildcard can use index SELECT * FROM products WHERE name LIKE 'super%'; ``` **Solution 2: Full-Text Search** ```sql -- PostgreSQL: Full-text search with GIN index CREATE INDEX idx_products_name_fts ON products USING GIN (to_tsvector('english', name)); SELECT * FROM products WHERE to_tsvector('english', name) @@ to_tsquery('english', 'widget'); -- MySQL: Full-text index CREATE FULLTEXT INDEX idx_products_name_fulltext ON products (name); SELECT * FROM products WHERE MATCH(name) AGAINST('widget' IN NATURAL LANGUAGE MODE); ``` **Solution 3: Trigram Index (PostgreSQL)** ```sql -- PostgreSQL: Trigram index for LIKE patterns CREATE EXTENSION IF NOT EXISTS pg_trgm; CREATE INDEX idx_products_name_trgm ON products USING GIN (name gin_trgm_ops); -- Now LIKE '%widget%' can use index SELECT * FROM products WHERE name LIKE '%widget%'; ``` ## Quick Reference: Anti-Patterns Summary | Anti-Pattern | Impact | Solution | |-------------|--------|----------| | SELECT * | Over-fetching, wasted I/O | SELECT specific columns | | N+1 queries | Network latency × N | JOIN or IN clause | | Missing FK indexes | Slow joins, cascades | Index all foreign keys | | Functions on columns | No index usage | Sargable conditions or expression index | | Type mismatch | Implicit conversion | Match data types | | Correlated subquery | Subquery per row | JOIN with GROUP BY | | Unnecessary DISTINCT | Expensive dedup | Remove if uniqueness guaranteed | | Multiple ORs | Poor index usage | IN clause or UNION ALL | | NOT IN with NULLs | Unexpected results | NOT EXISTS | | Leading wildcard LIKE | Full table scan | Full-text or trigram index | ## Anti-Pattern Detection Queries ### PostgreSQL: Find SELECT * Queries ```sql -- Enable query logging ALTER DATABASE yourdb SET log_statement = 'all'; -- Analyze logs for SELECT * patterns -- (requires log analysis tool or grep on log files) ``` ### PostgreSQL: Find Missing Foreign Key Indexes ```sql SELECT c.conrelid::regclass AS table_name, a.attname AS column_name, c.conname AS constraint_name FROM pg_constraint c JOIN pg_attribute a ON a.attnum = ANY(c.conkey) AND a.attrelid = c.conrelid WHERE c.contype = 'f' AND NOT EXISTS ( SELECT 1 FROM pg_index i WHERE i.indrelid = c.conrelid AND a.attnum = ANY(i.indkey) ); ``` ### SQL Server: Find Implicit Conversions ```sql -- Check execution plans for warnings -- Look for "CONVERT_IMPLICIT" warnings in graphical plans ``` ``` ### references/efficient-patterns.md ```markdown # Efficient Query Patterns Collection of proven efficient SQL patterns for optimal query performance. ## Table of Contents 1. [Existence Checks](#existence-checks) 2. [Pagination](#pagination) 3. [Aggregation Patterns](#aggregation-patterns) 4. [Union Operations](#union-operations) 5. [Window Functions](#window-functions) 6. [Batch Operations](#batch-operations) ## Existence Checks ### Pattern: EXISTS vs COUNT **Use Case:** Check if related records exist. **Inefficient:** ```sql -- ❌ Bad: Counts all matching rows SELECT * FROM users WHERE (SELECT COUNT(*) FROM orders WHERE orders.user_id = users.id) > 0; ``` **Efficient:** ```sql -- ✅ Good: Stops at first match SELECT * FROM users WHERE EXISTS (SELECT 1 FROM orders WHERE orders.user_id = users.id); ``` **Why EXISTS is Better:** - Stops execution at first matching row - No need to count all rows - Better performance for large result sets **Performance Comparison:** ``` COUNT(*) with 1000 matches: ~100ms EXISTS with 1000 matches: ~5ms (stops at first match) ``` ### Pattern: EXISTS vs IN **Use Case:** Filter by values in subquery. **Less Efficient:** ```sql -- ❌ Suboptimal: Builds full list SELECT * FROM users WHERE id IN (SELECT user_id FROM orders WHERE total > 1000); ``` **More Efficient:** ```sql -- ✅ Better: Can stop early with semi-join SELECT * FROM users WHERE EXISTS ( SELECT 1 FROM orders WHERE orders.user_id = users.id AND orders.total > 1000 ); ``` **When to Use Each:** - **EXISTS**: When checking existence or correlated conditions - **IN**: When list is small and static (e.g., `IN (1, 2, 3)`) ## Pagination ### Pattern: Efficient LIMIT/OFFSET **Inefficient for Deep Pagination:** ```sql -- ❌ Bad: Offset skips rows but database still processes them SELECT * FROM posts ORDER BY created_at DESC LIMIT 20 OFFSET 10000; -- Skip 10,000 rows ``` **Why It's Slow:** - Database processes all 10,020 rows - Sorts 10,020 rows - Then discards first 10,000 - Performance degrades linearly with offset **Efficient Keyset Pagination:** ```sql -- ✅ Good: Use last seen value as cursor SELECT * FROM posts WHERE created_at < '2025-01-15 10:30:00' -- Last seen timestamp ORDER BY created_at DESC LIMIT 20; ``` **Why It's Fast:** - Index scan starts at cursor position - No offset processing - Constant performance regardless of page depth **Implementation Pattern:** ```sql -- Page 1 SELECT id, title, created_at FROM posts ORDER BY created_at DESC, id DESC LIMIT 20; -- Page 2 (last_created_at = '2025-01-15 10:30:00', last_id = 12345) SELECT id, title, created_at FROM posts WHERE (created_at < '2025-01-15 10:30:00') OR (created_at = '2025-01-15 10:30:00' AND id < 12345) ORDER BY created_at DESC, id DESC LIMIT 20; ``` **Required Index:** ```sql CREATE INDEX idx_posts_created_id ON posts (created_at DESC, id DESC); ``` ### Pattern: Cursor-Based Pagination **API Response Format:** ```json { "data": [...], "cursor": { "next": "eyJjcmVhdGVkX2F0IjoiMjAyNS0wMS0xNSIsImlkIjoxMjM0NX0=", "has_more": true } } ``` **Cursor Encoding:** ```python import base64 import json # Encode cursor cursor_data = {"created_at": "2025-01-15 10:30:00", "id": 12345} cursor = base64.b64encode(json.dumps(cursor_data).encode()).decode() # Decode cursor cursor_data = json.loads(base64.b64decode(cursor).decode()) ``` ## Aggregation Patterns ### Pattern: Conditional Aggregation **Use Case:** Multiple aggregations with different conditions. **Inefficient (Multiple Subqueries):** ```sql -- ❌ Bad: Multiple scans of orders table SELECT c.id, c.name, (SELECT COUNT(*) FROM orders WHERE customer_id = c.id) AS total_orders, (SELECT COUNT(*) FROM orders WHERE customer_id = c.id AND status = 'completed') AS completed_orders, (SELECT SUM(total) FROM orders WHERE customer_id = c.id AND status = 'completed') AS revenue FROM customers c; ``` **Efficient (Single Query with CASE):** ```sql -- ✅ Good: Single scan with conditional aggregation SELECT c.id, c.name, COUNT(o.id) AS total_orders, COUNT(CASE WHEN o.status = 'completed' THEN 1 END) AS completed_orders, SUM(CASE WHEN o.status = 'completed' THEN o.total ELSE 0 END) AS revenue FROM customers c LEFT JOIN orders o ON c.id = o.customer_id GROUP BY c.id, c.name; ``` **Performance Comparison:** ``` Multiple subqueries: 3 table scans, ~300ms Conditional aggregation: 1 table scan, ~100ms ``` ### Pattern: Filtered Aggregation (PostgreSQL) **PostgreSQL FILTER Clause:** ```sql -- ✅ PostgreSQL-specific: FILTER clause (more readable) SELECT c.id, c.name, COUNT(o.id) AS total_orders, COUNT(o.id) FILTER (WHERE o.status = 'completed') AS completed_orders, SUM(o.total) FILTER (WHERE o.status = 'completed') AS revenue FROM customers c LEFT JOIN orders o ON c.id = o.customer_id GROUP BY c.id, c.name; ``` ## Union Operations ### Pattern: UNION ALL vs UNION **Inefficient:** ```sql -- ❌ Bad: UNION removes duplicates (expensive) SELECT id, name FROM active_users UNION SELECT id, name FROM trial_users; ``` **Why UNION is Expensive:** - Sorts both result sets - Performs deduplication - Additional memory and CPU overhead **Efficient:** ```sql -- ✅ Good: UNION ALL (no deduplication) SELECT id, name FROM active_users UNION ALL SELECT id, name FROM trial_users; ``` **When to Use Each:** - **UNION ALL**: When duplicates acceptable or datasets don't overlap (99% of cases) - **UNION**: Only when duplicates must be removed and datasets may overlap **Performance Comparison:** ``` UNION with 100k rows: ~500ms (sorting + dedup) UNION ALL with 100k rows: ~50ms (no overhead) ``` ### Pattern: Efficient Set Operations **Use Case:** Combine results from partitioned tables. ```sql -- ✅ Good: UNION ALL for partitioned data SELECT * FROM orders_2024 WHERE status = 'pending' UNION ALL SELECT * FROM orders_2025 WHERE status = 'pending'; ``` ## Window Functions ### Pattern: Row Numbering for Deduplication **Use Case:** Get first/last record per group. **Inefficient (Correlated Subquery):** ```sql -- ❌ Bad: Correlated subquery SELECT * FROM products p WHERE p.created_at = ( SELECT MAX(created_at) FROM products WHERE category_id = p.category_id ); ``` **Efficient (Window Function):** ```sql -- ✅ Good: Window function with CTE WITH ranked_products AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY category_id ORDER BY created_at DESC) AS rn FROM products ) SELECT * FROM ranked_products WHERE rn = 1; ``` **Performance Comparison:** ``` Correlated subquery: ~2000ms (N subqueries) Window function: ~200ms (single scan) ``` ### Pattern: Running Totals **Use Case:** Calculate cumulative sum. **Inefficient (Correlated Subquery):** ```sql -- ❌ Bad: Subquery per row SELECT order_date, total, (SELECT SUM(total) FROM orders o2 WHERE o2.order_date <= orders.order_date) AS running_total FROM orders; ``` **Efficient (Window Function):** ```sql -- ✅ Good: Window function SELECT order_date, total, SUM(total) OVER (ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total FROM orders; ``` ### Pattern: Rank Within Group **Use Case:** Find top N items per category. ```sql -- ✅ Efficient: Top 3 products per category by sales WITH ranked_products AS ( SELECT category_id, product_name, sales, RANK() OVER (PARTITION BY category_id ORDER BY sales DESC) AS rank FROM products ) SELECT * FROM ranked_products WHERE rank <= 3; ``` ## Batch Operations ### Pattern: Bulk INSERT **Inefficient:** ```sql -- ❌ Bad: Individual INSERTs (1000 round trips) INSERT INTO users (name, email) VALUES ('User1', '[email protected]'); INSERT INTO users (name, email) VALUES ('User2', '[email protected]'); -- ... 998 more ``` **Efficient:** ```sql -- ✅ Good: Bulk INSERT (1 round trip) INSERT INTO users (name, email) VALUES ('User1', '[email protected]'), ('User2', '[email protected]'), ('User3', '[email protected]'), -- ... up to 1000 rows ('User1000', '[email protected]'); ``` **Performance Comparison:** ``` Individual INSERTs: ~5000ms Bulk INSERT: ~50ms ``` **Batch Size Recommendations:** - **PostgreSQL**: 1000-5000 rows per INSERT - **MySQL**: 1000 rows per INSERT (max_allowed_packet limit) - **SQL Server**: 1000 rows per INSERT ### Pattern: Bulk UPDATE with CASE **Use Case:** Update multiple rows with different values. **Inefficient:** ```sql -- ❌ Bad: Multiple UPDATE statements UPDATE products SET price = 19.99 WHERE id = 1; UPDATE products SET price = 29.99 WHERE id = 2; UPDATE products SET price = 39.99 WHERE id = 3; ``` **Efficient:** ```sql -- ✅ Good: Single UPDATE with CASE UPDATE products SET price = CASE id WHEN 1 THEN 19.99 WHEN 2 THEN 29.99 WHEN 3 THEN 39.99 END WHERE id IN (1, 2, 3); ``` ### Pattern: Upsert (INSERT or UPDATE) **PostgreSQL (ON CONFLICT):** ```sql -- ✅ Efficient upsert INSERT INTO user_stats (user_id, login_count, last_login) VALUES (123, 1, NOW()) ON CONFLICT (user_id) DO UPDATE SET login_count = user_stats.login_count + 1, last_login = NOW(); ``` **MySQL (ON DUPLICATE KEY UPDATE):** ```sql -- ✅ Efficient upsert INSERT INTO user_stats (user_id, login_count, last_login) VALUES (123, 1, NOW()) ON DUPLICATE KEY UPDATE login_count = login_count + 1, last_login = NOW(); ``` **SQL Server (MERGE):** ```sql -- ✅ Efficient upsert MERGE INTO user_stats AS target USING (SELECT 123 AS user_id, 1 AS login_count, GETDATE() AS last_login) AS source ON target.user_id = source.user_id WHEN MATCHED THEN UPDATE SET login_count = target.login_count + 1, last_login = GETDATE() WHEN NOT MATCHED THEN INSERT (user_id, login_count, last_login) VALUES (source.user_id, source.login_count, source.last_login); ``` ## Common Table Expressions (CTEs) ### Pattern: Readable Complex Queries **Inefficient (Nested Subqueries):** ```sql -- ❌ Hard to read and maintain SELECT * FROM ( SELECT * FROM ( SELECT user_id, SUM(total) as revenue FROM orders WHERE status = 'completed' GROUP BY user_id ) AS user_revenue WHERE revenue > 1000 ) AS high_value_users INNER JOIN users ON users.id = high_value_users.user_id; ``` **Efficient (CTEs):** ```sql -- ✅ Readable and maintainable WITH user_revenue AS ( SELECT user_id, SUM(total) as revenue FROM orders WHERE status = 'completed' GROUP BY user_id ), high_value_users AS ( SELECT * FROM user_revenue WHERE revenue > 1000 ) SELECT users.*, high_value_users.revenue FROM high_value_users INNER JOIN users ON users.id = high_value_users.user_id; ``` **Benefits:** - Better readability - Easier debugging (can SELECT from CTEs individually) - Query optimizer can optimize entire query ### Pattern: Recursive CTEs **Use Case:** Hierarchical data (org charts, nested categories). ```sql -- ✅ Recursive CTE for org chart WITH RECURSIVE org_chart AS ( -- Base case: top-level managers SELECT id, name, manager_id, 1 AS level FROM employees WHERE manager_id IS NULL UNION ALL -- Recursive case: employees reporting to previous level SELECT e.id, e.name, e.manager_id, oc.level + 1 FROM employees e INNER JOIN org_chart oc ON e.manager_id = oc.id ) SELECT * FROM org_chart ORDER BY level, name; ``` ## Index-Friendly Patterns ### Pattern: Prefix Matching **Efficient:** ```sql -- ✅ Can use B-tree index SELECT * FROM users WHERE email LIKE 'john%'; ``` **Index:** ```sql CREATE INDEX idx_users_email ON users (email); ``` ### Pattern: Range Queries **Efficient:** ```sql -- ✅ Can use B-tree index SELECT * FROM orders WHERE created_at >= '2025-01-01' AND created_at < '2026-01-01'; ``` **Index:** ```sql CREATE INDEX idx_orders_created ON orders (created_at); ``` ### Pattern: Composite Filters **Efficient:** ```sql -- ✅ Uses composite index efficiently SELECT * FROM orders WHERE customer_id = 123 AND status = 'pending' ORDER BY created_at DESC LIMIT 10; ``` **Optimal Index:** ```sql CREATE INDEX idx_orders_customer_status_created ON orders (customer_id, status, created_at DESC); ``` ## Quick Reference ### Existence Checks - Use `EXISTS` instead of `COUNT(*) > 0` - Use `NOT EXISTS` instead of `NOT IN` (handles NULLs) ### Pagination - Use keyset/cursor pagination instead of OFFSET for deep pagination - Index columns in ORDER BY and WHERE clauses ### Aggregation - Use conditional aggregation (CASE in aggregate) instead of multiple subqueries - PostgreSQL: Use FILTER clause for readability ### Unions - Use `UNION ALL` by default (no deduplication overhead) - Only use `UNION` when duplicates must be removed ### Window Functions - Use window functions instead of correlated subqueries for ranking/running totals - More efficient and more readable ### Batch Operations - Bulk INSERT/UPDATE instead of row-by-row operations - Use upsert operations (ON CONFLICT, ON DUPLICATE KEY, MERGE) ### CTEs - Use CTEs for complex queries (better readability) - PostgreSQL 12+: CTEs are inline-optimized by default ``` ### references/postgresql.md ```markdown # PostgreSQL-Specific Optimizations PostgreSQL-specific features, index types, and optimization techniques. ## PostgreSQL Index Types ### B-tree (Default) **Use Case:** General-purpose index for most queries. **Supported Operators:** <, ≤, =, ≥, >, BETWEEN, IN, IS NULL, IS NOT NULL **Creation:** ```sql CREATE INDEX idx_users_email ON users (email); -- Or explicitly: CREATE INDEX idx_users_email ON users USING BTREE (email); ``` ### Hash **Use Case:** Equality comparisons only (=). **Benefits:** Faster than B-tree for equality, smaller index size. **Limitations:** Only supports =, no range queries. **Creation:** ```sql CREATE INDEX idx_users_email_hash ON users USING HASH (email); ``` **When to Use:** High-frequency equality lookups only. ### GIN (Generalized Inverted Index) **Use Case:** Full-text search, JSONB, arrays, composite types. **Supported Types:** - JSONB documents - Arrays - Full-text search (tsvector) - hstore **Full-Text Search:** ```sql -- Create GIN index on tsvector CREATE INDEX idx_articles_content_fts ON articles USING GIN (to_tsvector('english', content)); -- Query SELECT * FROM articles WHERE to_tsvector('english', content) @@ to_tsquery('english', 'postgres & optimization'); ``` **JSONB:** ```sql -- Create GIN index on JSONB column CREATE INDEX idx_users_metadata ON users USING GIN (metadata); -- Query with containment SELECT * FROM users WHERE metadata @> '{"premium": true}'; -- Query with existence SELECT * FROM users WHERE metadata ? 'premium'; ``` **Arrays:** ```sql -- Create GIN index on array column CREATE INDEX idx_products_tags ON products USING GIN (tags); -- Query with array overlap SELECT * FROM products WHERE tags && ARRAY['electronics', 'sale']; -- Query with array contains SELECT * FROM products WHERE tags @> ARRAY['electronics']; ``` ### GiST (Generalized Search Tree) **Use Case:** Spatial data, ranges, full-text search, geometric types. **Geometric Types:** ```sql -- Create GiST index on geometry column CREATE INDEX idx_locations_point ON locations USING GIST (point); -- Query with spatial operators SELECT * FROM locations WHERE point <-> POINT(40.7128, -74.0060) < 10; -- Within 10 units ``` **Range Types:** ```sql -- Create GiST index on date range CREATE INDEX idx_bookings_dates ON bookings USING GIST (date_range); -- Query with range overlap SELECT * FROM bookings WHERE date_range && '[2025-01-01,2025-01-31)'::daterange; ``` ### BRIN (Block Range Index) **Use Case:** Very large tables (100GB+) with naturally ordered data. **Benefits:** - Tiny index size (1000x smaller than B-tree) - Very fast to build - Minimal maintenance overhead **Limitations:** - Only efficient for ordered data (timestamps, sequential IDs) - Less precise than B-tree (may scan extra blocks) **Creation:** ```sql -- BRIN index on timestamp column CREATE INDEX idx_logs_created_brin ON logs USING BRIN (created_at); ``` **When to Use:** - Tables >100GB - Naturally ordered data (append-only logs, time-series) - Insert-heavy workloads **Performance:** ``` B-tree index on 1TB table: ~50GB index size BRIN index on 1TB table: ~50MB index size ``` ## PostgreSQL-Specific Features ### Partial Indexes **Use Case:** Index only subset of rows matching condition. **Benefits:** - Smaller index size - Faster index scans - Lower maintenance overhead **Example 1: Index Active Users Only** ```sql CREATE INDEX idx_users_active ON users (last_login) WHERE status = 'active'; ``` **Example 2: Index Recent Orders** ```sql CREATE INDEX idx_orders_recent ON orders (created_at) WHERE created_at > NOW() - INTERVAL '90 days'; ``` **When to Use:** - Queries frequently filter on same condition - Minority of rows match condition (<20%) ### Expression Indexes **Use Case:** Index computed values or function results. **Example 1: Case-Insensitive Search** ```sql CREATE INDEX idx_users_email_lower ON users (LOWER(email)); -- Query must use same expression SELECT * FROM users WHERE LOWER(email) = '[email protected]'; ``` **Example 2: Date Truncation** ```sql CREATE INDEX idx_orders_created_date ON orders (DATE(created_at)); -- Query for specific date SELECT * FROM orders WHERE DATE(created_at) = '2025-01-15'; ``` **Example 3: JSONB Path** ```sql CREATE INDEX idx_users_premium ON users ((metadata->>'premium')); SELECT * FROM users WHERE metadata->>'premium' = 'true'; ``` ### Covering Indexes (INCLUDE Clause) **Use Case:** Include non-indexed columns in index for Index-Only Scans. **Syntax:** ```sql CREATE INDEX idx_users_email_covering ON users (email) INCLUDE (id, name, created_at); ``` **Query Benefits:** ```sql -- This query uses Index-Only Scan (no heap access) SELECT id, name, created_at FROM users WHERE email = '[email protected]'; ``` **Performance:** ``` Without INCLUDE: Index Scan + Heap Fetch = 10ms With INCLUDE: Index-Only Scan = 1ms ``` ### Concurrent Index Creation **Use Case:** Create indexes on production tables without blocking writes. **Standard Index Creation (Locks Table):** ```sql CREATE INDEX idx_users_email ON users (email); -- Blocks all writes until complete ``` **Concurrent Index Creation (No Lock):** ```sql CREATE INDEX CONCURRENTLY idx_users_email ON users (email); -- Allows concurrent writes, takes longer ``` **Trade-offs:** - Slower to build (multiple table scans) - No table lock (production-safe) - Can fail mid-build (index marked INVALID) **Cleanup Failed Index:** ```sql -- List invalid indexes SELECT indexname FROM pg_indexes WHERE indexname LIKE '%_ccnew%'; -- Drop invalid index DROP INDEX CONCURRENTLY idx_users_email; ``` ## PostgreSQL Execution Plan Analysis ### EXPLAIN ANALYZE with Buffers **Basic EXPLAIN ANALYZE:** ```sql EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 123; ``` **With Buffer Statistics:** ```sql EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE customer_id = 123; ``` **Output:** ``` Index Scan using idx_orders_customer on orders (cost=0.42..12.44 rows=10 width=200) (actual time=0.025..0.035 rows=10 loops=1) Index Cond: (customer_id = 123) Buffers: shared hit=5 Planning Time: 0.150 ms Execution Time: 0.050 ms ``` **Buffer Analysis:** - **shared hit**: Pages read from cache (good) - **shared read**: Pages read from disk (slow) - **shared dirtied**: Pages modified - **shared written**: Pages written to disk **Goal:** Maximize "shared hit", minimize "shared read". ### EXPLAIN Options ```sql -- Verbose output (column list, table aliases) EXPLAIN (VERBOSE) SELECT ...; -- Show costs without execution EXPLAIN SELECT ...; -- Execute and show actual timing EXPLAIN ANALYZE SELECT ...; -- Show buffer usage EXPLAIN (ANALYZE, BUFFERS) SELECT ...; -- Show detailed timing per node EXPLAIN (ANALYZE, TIMING) SELECT ...; -- Machine-readable JSON format EXPLAIN (ANALYZE, FORMAT JSON) SELECT ...; ``` ## PostgreSQL Query Optimization ### Parallel Query Execution **Enable Parallel Queries:** ```sql -- Check parallel workers setting SHOW max_parallel_workers_per_gather; -- Set parallel workers for session SET max_parallel_workers_per_gather = 4; ``` **Parallel Seq Scan:** ```sql EXPLAIN SELECT COUNT(*) FROM large_table; ``` ``` Finalize Aggregate (cost=xxx rows=1) -> Gather (cost=xxx rows=4) Workers Planned: 4 -> Partial Aggregate -> Parallel Seq Scan on large_table ``` **Force Parallel Execution:** ```sql -- Lower threshold for parallel scans SET parallel_setup_cost = 0; SET parallel_tuple_cost = 0; ``` ### Common Table Expressions (CTEs) **Materialized CTEs (Default in PostgreSQL 12+):** ```sql -- CTE is inline-optimized (not materialized) WITH recent_orders AS ( SELECT * FROM orders WHERE created_at > NOW() - INTERVAL '7 days' ) SELECT * FROM recent_orders WHERE total > 100; ``` **Force Materialization:** ```sql -- Explicitly materialize CTE WITH recent_orders AS MATERIALIZED ( SELECT * FROM orders WHERE created_at > NOW() - INTERVAL '7 days' ) SELECT * FROM recent_orders WHERE total > 100; ``` **When to Materialize:** - CTE used multiple times in main query - CTE is expensive and produces small result set ### LATERAL Joins **Use Case:** Correlated subqueries with better performance. **Without LATERAL (Correlated Subquery):** ```sql SELECT u.name, (SELECT COUNT(*) FROM orders WHERE orders.user_id = u.id) AS order_count FROM users u; ``` **With LATERAL:** ```sql SELECT u.name, o.order_count FROM users u LEFT JOIN LATERAL ( SELECT COUNT(*) AS order_count FROM orders WHERE orders.user_id = u.id ) o ON true; ``` **Advanced LATERAL (Top N per Group):** ```sql -- Get 3 most recent orders per customer SELECT c.name, o.order_date, o.total FROM customers c LEFT JOIN LATERAL ( SELECT order_date, total FROM orders WHERE customer_id = c.id ORDER BY order_date DESC LIMIT 3 ) o ON true; ``` ## PostgreSQL Configuration Tuning ### Memory Settings **shared_buffers:** ```sql -- 25% of system RAM (typical recommendation) ALTER SYSTEM SET shared_buffers = '4GB'; ``` **work_mem:** ```sql -- Memory per sort/hash operation -- Set per session for complex queries SET work_mem = '256MB'; ``` **effective_cache_size:** ```sql -- Estimate of OS file cache -- 50-75% of system RAM ALTER SYSTEM SET effective_cache_size = '12GB'; ``` ### Query Planning **random_page_cost:** ```sql -- Lower for SSDs (default 4.0) ALTER SYSTEM SET random_page_cost = 1.1; ``` **effective_io_concurrency:** ```sql -- Number of concurrent I/O operations -- Higher for SSDs (default 1) ALTER SYSTEM SET effective_io_concurrency = 200; ``` ### Statistics **default_statistics_target:** ```sql -- Increase for better query planning (default 100) ALTER SYSTEM SET default_statistics_target = 500; -- Per-column statistics ALTER TABLE orders ALTER COLUMN customer_id SET STATISTICS 1000; ``` ## PostgreSQL Monitoring ### Find Slow Queries **Enable slow query logging:** ```sql -- Log queries slower than 100ms ALTER SYSTEM SET log_min_duration_statement = 100; SELECT pg_reload_conf(); ``` ### Index Usage Statistics **Find unused indexes:** ```sql SELECT schemaname, tablename, indexname, idx_scan, pg_size_pretty(pg_relation_size(indexrelid)) AS index_size FROM pg_stat_user_indexes WHERE idx_scan = 0 AND indexname NOT LIKE 'pg_toast_%' ORDER BY pg_relation_size(indexrelid) DESC; ``` **Find most used indexes:** ```sql SELECT schemaname, tablename, indexname, idx_scan, idx_tup_read, idx_tup_fetch FROM pg_stat_user_indexes ORDER BY idx_scan DESC LIMIT 20; ``` ### Table Statistics **Update statistics:** ```sql -- Analyze entire database ANALYZE; -- Analyze specific table ANALYZE orders; -- Verbose output ANALYZE VERBOSE orders; ``` **View table statistics:** ```sql SELECT schemaname, tablename, n_live_tup AS live_rows, n_dead_tup AS dead_rows, last_vacuum, last_autovacuum, last_analyze, last_autoanalyze FROM pg_stat_user_tables ORDER BY n_live_tup DESC; ``` ## Quick Reference ### Index Type Selection | Use Case | Index Type | Creation | |----------|-----------|----------| | General queries | B-tree | `CREATE INDEX ON table (column)` | | Equality only | Hash | `CREATE INDEX ON table USING HASH (column)` | | Full-text search | GIN | `CREATE INDEX ON table USING GIN (to_tsvector('english', column))` | | JSONB | GIN | `CREATE INDEX ON table USING GIN (jsonb_column)` | | Arrays | GIN | `CREATE INDEX ON table USING GIN (array_column)` | | Spatial data | GiST | `CREATE INDEX ON table USING GIST (geometry_column)` | | Large ordered tables | BRIN | `CREATE INDEX ON table USING BRIN (timestamp_column)` | ### PostgreSQL-Specific Optimizations - Use **partial indexes** for frequently-filtered subsets - Use **expression indexes** for computed values - Use **INCLUDE** clause for covering indexes - Use **LATERAL** for efficient correlated subqueries - Use **BRIN** for very large time-series tables - Create indexes **CONCURRENTLY** on production tables ``` ### references/mysql.md ```markdown # MySQL-Specific Optimizations MySQL-specific features, storage engines, and optimization techniques. ## MySQL Storage Engines ### InnoDB (Default) **Characteristics:** - ACID-compliant transactions - Row-level locking - Crash recovery - Foreign key support - Clustered primary key **Use Case:** Default choice for most applications. **Creation:** ```sql CREATE TABLE orders ( id INT PRIMARY KEY, customer_id INT, total DECIMAL(10,2) ) ENGINE=InnoDB; ``` **Clustered Index:** - Table data stored in primary key order - Fast primary key lookups - Choose small primary key (INT vs BIGINT vs UUID) **Recommendation:** Use AUTO_INCREMENT INT or BIGINT for primary keys. ### MyISAM **Characteristics:** - No transactions - Table-level locking - Faster reads (no MVCC overhead) - No foreign keys - No crash recovery **Use Case:** Read-heavy tables with no writes (archives, logs). **Creation:** ```sql CREATE TABLE archive_logs ( id INT PRIMARY KEY, message TEXT ) ENGINE=MyISAM; ``` **Warning:** Deprecated, avoid for new applications. ## MySQL Index Types ### B-tree Index (Default) **Use Case:** General-purpose index. **Creation:** ```sql CREATE INDEX idx_users_email ON users (email); ``` **Prefix Indexes for Long Strings:** ```sql -- Index first 10 characters CREATE INDEX idx_articles_title ON articles (title(10)); ``` **Composite Indexes:** ```sql CREATE INDEX idx_orders_customer_status ON orders (customer_id, status); ``` ### Full-Text Index **Use Case:** Text search on VARCHAR/TEXT columns. **Creation:** ```sql -- Add full-text index CREATE FULLTEXT INDEX idx_articles_content ON articles (content); -- Or in table definition CREATE TABLE articles ( id INT PRIMARY KEY, title VARCHAR(255), content TEXT, FULLTEXT (title, content) ) ENGINE=InnoDB; ``` **Query Syntax:** ```sql -- Natural language search SELECT * FROM articles WHERE MATCH(content) AGAINST('mysql optimization'); -- Boolean mode (AND/OR/NOT) SELECT * FROM articles WHERE MATCH(content) AGAINST('+mysql +optimization -postgres' IN BOOLEAN MODE); -- Query expansion (finds related terms) SELECT * FROM articles WHERE MATCH(content) AGAINST('database' WITH QUERY EXPANSION); ``` ### Spatial Index **Use Case:** Geometric data (points, polygons). **Creation:** ```sql CREATE TABLE locations ( id INT PRIMARY KEY, name VARCHAR(255), coordinates POINT NOT NULL, SPATIAL INDEX(coordinates) ) ENGINE=InnoDB; ``` **Query:** ```sql -- Find locations within bounding box SELECT name FROM locations WHERE MBRContains( ST_GeomFromText('POLYGON((0 0, 10 0, 10 10, 0 10, 0 0))'), coordinates ); ``` ## MySQL EXPLAIN Analysis ### EXPLAIN Output Format **Basic EXPLAIN:** ```sql EXPLAIN SELECT * FROM orders WHERE customer_id = 123; ``` **Output:** ``` +----+-------------+--------+------+-------------------+------+---------+------+-------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+--------+------+-------------------+------+---------+------+-------+-------------+ | 1 | SIMPLE | orders | ref | idx_orders_cust | idx | 4 | const| 150 | Using where | +----+-------------+--------+------+-------------------+------+---------+------+-------+-------------+ ``` ### EXPLAIN FORMAT=JSON **Detailed JSON Output:** ```sql EXPLAIN FORMAT=JSON SELECT * FROM orders WHERE customer_id = 123; ``` **Benefits:** - More detailed information - Nested structure for complex queries - Cost estimates - Filtering statistics ### Access Types (type column) **Performance Ranking (Best to Worst):** 1. **system** - Single row table 2. **const** - Primary key/unique lookup with constant 3. **eq_ref** - One row per previous table row (unique index join) 4. **ref** - Non-unique index lookup 5. **range** - Index range scan (BETWEEN, >, <, IN) 6. **index** - Full index scan 7. **ALL** - Full table scan **Target:** const, eq_ref, ref, or range ### Extra Column Meanings **Good:** - **Using index** - Index-only scan (covering index) - **Using index condition** - Index condition pushdown (ICP) **Acceptable:** - **Using where** - WHERE clause filtering after retrieval **Warning:** - **Using temporary** - Temporary table created - **Using filesort** - Sorting required (not index-based) **Bad:** - **Using join buffer** - Join without index (add index!) ## MySQL Index Hints ### USE INDEX **Suggest Index:** ```sql SELECT * FROM orders USE INDEX (idx_orders_customer) WHERE customer_id = 123 AND created_at > '2025-01-01'; ``` **When to Use:** Optimizer chooses wrong index. ### FORCE INDEX **Force Index Usage:** ```sql SELECT * FROM orders FORCE INDEX (idx_orders_customer) WHERE customer_id = 123; ``` **When to Use:** Must use specific index (rare). ### IGNORE INDEX **Prevent Index Usage:** ```sql SELECT * FROM orders IGNORE INDEX (idx_orders_status) WHERE customer_id = 123 AND status = 'pending'; ``` **When to Use:** Force full table scan or different index. ### Optimizer Hints (MySQL 8.0+) **JOIN Order Hints:** ```sql SELECT /*+ JOIN_ORDER(orders, customers) */ orders.*, customers.name FROM orders INNER JOIN customers ON orders.customer_id = customers.id; ``` **Index Hints:** ```sql SELECT /*+ INDEX(orders idx_orders_customer) */ * FROM orders WHERE customer_id = 123; ``` **Subquery Hints:** ```sql SELECT /*+ SUBQUERY(MATERIALIZATION) */ * FROM users WHERE id IN (SELECT user_id FROM orders WHERE total > 1000); ``` ## MySQL-Specific Optimizations ### Generated Columns (MySQL 5.7+) **Use Case:** Index computed values. **Virtual Generated Column:** ```sql ALTER TABLE users ADD COLUMN email_lower VARCHAR(255) GENERATED ALWAYS AS (LOWER(email)) VIRTUAL; -- Index generated column CREATE INDEX idx_users_email_lower ON users (email_lower); ``` **Stored Generated Column:** ```sql ALTER TABLE users ADD COLUMN full_name VARCHAR(510) GENERATED ALWAYS AS (CONCAT(first_name, ' ', last_name)) STORED; CREATE INDEX idx_users_full_name ON users (full_name); ``` **Difference:** - **VIRTUAL**: Computed on read (no storage overhead) - **STORED**: Computed on write (faster reads, storage overhead) ### JSON Indexes **Create Generated Column for JSON Path:** ```sql -- Extract JSON field ALTER TABLE users ADD COLUMN premium_status VARCHAR(10) GENERATED ALWAYS AS (metadata->>'$.premium') STORED; -- Index generated column CREATE INDEX idx_users_premium ON users (premium_status); ``` **Query:** ```sql SELECT * FROM users WHERE premium_status = 'true'; ``` ### Index Condition Pushdown (ICP) **MySQL 5.6+ Feature:** Push WHERE conditions down to storage engine. **Without ICP:** ```sql -- Storage engine returns all rows matching first index column -- MySQL server filters remaining conditions ``` **With ICP:** ```sql -- Storage engine filters all index columns -- Fewer rows returned to MySQL server ``` **Check if Enabled:** ```sql SHOW VARIABLES LIKE 'optimizer_switch'; -- Look for index_condition_pushdown=on ``` **EXPLAIN Indicator:** ``` Extra: Using index condition ``` ### Multi-Range Read (MRR) **MySQL 5.6+ Feature:** Optimize range scans by sorting row IDs before fetching. **Benefits:** - Sequential I/O instead of random I/O - Fewer page fetches **Enable:** ```sql SET optimizer_switch='mrr=on,mrr_cost_based=off'; ``` ## MySQL Configuration Tuning ### Buffer Pool Size (InnoDB) **Recommendation:** 70-80% of system RAM for dedicated database server. ```sql -- Check current setting SHOW VARIABLES LIKE 'innodb_buffer_pool_size'; -- Set in my.cnf / my.ini [mysqld] innodb_buffer_pool_size = 8G ``` ### Query Cache (Deprecated) **Warning:** Query cache removed in MySQL 8.0. **MySQL 5.7 and Earlier:** ```sql -- Check query cache status SHOW VARIABLES LIKE 'query_cache%'; -- Disable query cache (recommended for modern apps) query_cache_type = 0 query_cache_size = 0 ``` **Replacement:** Application-level caching (Redis, Memcached). ### Join Buffer Size **Used for:** Joins without indexes. ```sql -- Check current setting SHOW VARIABLES LIKE 'join_buffer_size'; -- Set per session SET SESSION join_buffer_size = 8388608; -- 8MB ``` ### Sort Buffer Size **Used for:** ORDER BY, GROUP BY operations. ```sql -- Check current setting SHOW VARIABLES LIKE 'sort_buffer_size'; -- Set per session SET SESSION sort_buffer_size = 2097152; -- 2MB ``` ## MySQL Monitoring ### Slow Query Log **Enable:** ```sql -- Enable slow query log SET GLOBAL slow_query_log = 'ON'; SET GLOBAL long_query_time = 0.5; -- Log queries > 500ms SET GLOBAL slow_query_log_file = '/var/log/mysql/slow-query.log'; ``` **Analyze Slow Queries:** ```bash # mysqldumpslow - Parse slow query log mysqldumpslow -s t -t 10 /var/log/mysql/slow-query.log # -s t: Sort by time # -t 10: Top 10 queries ``` ### Performance Schema **Enable:** ```sql -- Check if enabled SHOW VARIABLES LIKE 'performance_schema'; -- Enable in my.cnf [mysqld] performance_schema = ON ``` **Query Statistics:** ```sql -- Top 10 slowest queries SELECT DIGEST_TEXT, COUNT_STAR, AVG_TIMER_WAIT / 1000000000 AS avg_ms, SUM_TIMER_WAIT / 1000000000 AS total_ms FROM performance_schema.events_statements_summary_by_digest ORDER BY SUM_TIMER_WAIT DESC LIMIT 10; ``` **Index Usage:** ```sql -- Tables without primary key SELECT TABLE_SCHEMA, TABLE_NAME FROM information_schema.TABLES WHERE TABLE_SCHEMA NOT IN ('mysql', 'information_schema', 'performance_schema') AND ENGINE = 'InnoDB' AND TABLE_TYPE = 'BASE TABLE' AND TABLE_CATALOG IS NULL AND NOT EXISTS ( SELECT 1 FROM information_schema.STATISTICS WHERE TABLE_SCHEMA = TABLES.TABLE_SCHEMA AND TABLE_NAME = TABLES.TABLE_NAME AND INDEX_NAME = 'PRIMARY' ); ``` ### Table Statistics **Update Statistics:** ```sql -- Analyze table ANALYZE TABLE orders; -- Optimize table (rebuilds, updates stats) OPTIMIZE TABLE orders; ``` **View Statistics:** ```sql -- Table sizes SELECT TABLE_SCHEMA, TABLE_NAME, ROUND(DATA_LENGTH / 1024 / 1024, 2) AS data_mb, ROUND(INDEX_LENGTH / 1024 / 1024, 2) AS index_mb, ROUND((DATA_LENGTH + INDEX_LENGTH) / 1024 / 1024, 2) AS total_mb, TABLE_ROWS FROM information_schema.TABLES WHERE TABLE_SCHEMA = 'your_database' ORDER BY (DATA_LENGTH + INDEX_LENGTH) DESC; ``` ## MySQL Best Practices ### Primary Key Selection **Recommended:** ```sql -- Auto-increment integer (clustered index friendly) CREATE TABLE orders ( id INT AUTO_INCREMENT PRIMARY KEY, ... ) ENGINE=InnoDB; ``` **Avoid:** ```sql -- UUID as primary key (poor clustered index performance) CREATE TABLE orders ( id CHAR(36) PRIMARY KEY DEFAULT (UUID()), -- ❌ Fragmentation ... ) ENGINE=InnoDB; ``` **UUID Alternative (MySQL 8.0+):** ```sql -- Use UUID_TO_BIN with reordering for better clustering CREATE TABLE orders ( id BINARY(16) PRIMARY KEY DEFAULT (UUID_TO_BIN(UUID(), 1)), ... ) ENGINE=InnoDB; ``` ### Composite Index Order **Rule:** Equality filters → Range filters → ORDER BY columns ```sql -- Query pattern SELECT * FROM orders WHERE customer_id = 123 AND status IN ('pending', 'processing') AND created_at > '2025-01-01' ORDER BY created_at DESC; -- Optimal index CREATE INDEX idx_orders_customer_status_created ON orders (customer_id, status, created_at); ``` ### Avoid SELECT * **Bad:** ```sql SELECT * FROM users WHERE id = 123; ``` **Good:** ```sql SELECT id, name, email FROM users WHERE id = 123; ``` ### Use LIMIT for Large Result Sets **Always limit:** ```sql -- Good: Limits results SELECT * FROM orders ORDER BY created_at DESC LIMIT 100; -- Bad: No limit on large table SELECT * FROM orders ORDER BY created_at DESC; ``` ## Quick Reference ### Index Type Selection | Use Case | Index Type | Creation | |----------|-----------|----------| | General queries | B-tree | `CREATE INDEX ON table (column)` | | Long strings | B-tree prefix | `CREATE INDEX ON table (column(10))` | | Full-text search | Full-text | `CREATE FULLTEXT INDEX ON table (column)` | | Spatial data | Spatial | `CREATE SPATIAL INDEX ON table (column)` | ### Storage Engine Selection | Requirement | Engine | Notes | |------------|--------|-------| | Transactions | InnoDB | Default, recommended | | Read-only archive | MyISAM | Deprecated, avoid | | In-memory | MEMORY | Temporary tables only | ### Configuration Priorities 1. **innodb_buffer_pool_size**: 70-80% of RAM 2. **innodb_log_file_size**: 256MB-1GB 3. **max_connections**: Based on workload (100-200 typical) 4. **innodb_flush_log_at_trx_commit**: 2 for better performance (1 for durability) ``` ### references/sqlserver.md ```markdown # SQL Server-Specific Optimizations SQL Server-specific features, execution plans, and optimization techniques. ## SQL Server Index Types ### Clustered Index **Characteristics:** - Table data physically sorted by index key - One per table - Primary key creates clustered index by default **Creation:** ```sql -- Explicit clustered index CREATE CLUSTERED INDEX IX_Orders_OrderDate ON Orders (OrderDate); -- Primary key with clustered index (default) CREATE TABLE Orders ( OrderID INT PRIMARY KEY CLUSTERED, ... ); ``` **Benefits:** - Fast range queries on index key - Fast ORDER BY on index columns **Considerations:** - Choose narrow, unique, sequential key (avoid GUID) - Use INT IDENTITY or BIGINT IDENTITY ### Non-Clustered Index **Characteristics:** - Separate structure from table data - Multiple non-clustered indexes per table - Includes pointer to clustered index key **Creation:** ```sql CREATE NONCLUSTERED INDEX IX_Orders_CustomerID ON Orders (CustomerID); ``` ### Covering Index with INCLUDE **Use Case:** Include non-indexed columns for Index-Only Scans. **Syntax:** ```sql CREATE NONCLUSTERED INDEX IX_Orders_CustomerID_Covering ON Orders (CustomerID) INCLUDE (OrderDate, TotalAmount, Status); ``` **Query Benefits:** ```sql -- Uses covering index (no Key Lookup) SELECT OrderDate, TotalAmount, Status FROM Orders WHERE CustomerID = 123; ``` ### Filtered Index **Use Case:** Index subset of rows (similar to PostgreSQL partial index). **Creation:** ```sql -- Index only active orders CREATE NONCLUSTERED INDEX IX_Orders_Active ON Orders (OrderDate) WHERE Status = 'Active'; -- Index only recent orders CREATE NONCLUSTERED INDEX IX_Orders_Recent ON Orders (OrderDate) WHERE OrderDate >= '2025-01-01'; ``` **Benefits:** - Smaller index size - Faster maintenance - More efficient for common filtered queries ### Columnstore Index **Use Case:** Data warehouse, analytics queries. **Clustered Columnstore:** ```sql -- Convert entire table to columnstore CREATE CLUSTERED COLUMNSTORE INDEX CCI_Sales ON Sales; ``` **Non-Clustered Columnstore:** ```sql -- Add columnstore index alongside rowstore CREATE NONCLUSTERED COLUMNSTORE INDEX NCCI_Sales ON Sales (ProductID, OrderDate, Quantity, Amount); ``` **Benefits:** - 10x compression - 10-100x faster aggregation queries - Batch mode execution **When to Use:** - Large fact tables (millions of rows) - Read-heavy analytical queries - Data warehouse scenarios ## SQL Server Execution Plans ### Accessing Execution Plans **Estimated Execution Plan (Ctrl+L):** - No query execution - Estimated costs and row counts - Quick analysis **Actual Execution Plan (Ctrl+M, then execute):** - Query executes - Actual row counts and timing - More accurate for optimization ### Reading Execution Plans **Graphical Plan Direction:** Right to left, top to bottom. **Operation Icons:** - **Clustered Index Scan** - Full table scan (expensive) - **Clustered Index Seek** - Efficient lookup (good) - **Index Seek** - Non-clustered index lookup (good) - **Index Scan** - Full index scan (less efficient) - **Table Scan** - Full heap scan (worst) - **Key Lookup** - Additional lookup for non-indexed columns - **Nested Loops** - Join algorithm - **Hash Match** - Join or aggregation **Arrow Thickness:** Relative row count (thicker = more rows). **Warnings (Yellow Exclamation Mark):** - Missing index suggestion - Implicit type conversion - Excessive memory grant - Spills to tempdb ### Missing Index Suggestions **Execution Plan Recommendation:** ```xml <MissingIndexes> <MissingIndexGroup Impact="95.5"> <MissingIndex Database="YourDB" Schema="dbo" Table="Orders"> <ColumnGroup Usage="EQUALITY"> <Column Name="CustomerID" /> </ColumnGroup> <ColumnGroup Usage="INCLUDE"> <Column Name="OrderDate" /> <Column Name="TotalAmount" /> </ColumnGroup> </MissingIndex> </MissingIndexGroup> </MissingIndexes> ``` **Create Suggested Index:** ```sql CREATE NONCLUSTERED INDEX IX_Orders_CustomerID ON Orders (CustomerID) INCLUDE (OrderDate, TotalAmount); ``` ## SQL Server Query Store ### Enable Query Store **Database-Level Setting:** ```sql ALTER DATABASE YourDatabase SET QUERY_STORE = ON ( OPERATION_MODE = READ_WRITE, DATA_FLUSH_INTERVAL_SECONDS = 900, INTERVAL_LENGTH_MINUTES = 60, MAX_STORAGE_SIZE_MB = 1000, QUERY_CAPTURE_MODE = AUTO ); ``` ### Find Expensive Queries **Top Queries by Duration:** ```sql SELECT TOP 10 q.query_id, qt.query_sql_text, rs.avg_duration / 1000.0 AS avg_duration_ms, rs.avg_logical_io_reads, rs.avg_physical_io_reads, rs.count_executions, rs.last_execution_time FROM sys.query_store_query q INNER JOIN sys.query_store_query_text qt ON q.query_text_id = qt.query_text_id INNER JOIN sys.query_store_plan p ON q.query_id = p.query_id INNER JOIN sys.query_store_runtime_stats rs ON p.plan_id = rs.plan_id WHERE rs.last_execution_time > DATEADD(DAY, -7, GETDATE()) ORDER BY rs.avg_duration DESC; ``` **Top Queries by CPU:** ```sql SELECT TOP 10 qt.query_sql_text, rs.avg_cpu_time / 1000.0 AS avg_cpu_ms, rs.count_executions, rs.avg_duration / 1000.0 AS avg_duration_ms FROM sys.query_store_query q INNER JOIN sys.query_store_query_text qt ON q.query_text_id = qt.query_text_id INNER JOIN sys.query_store_plan p ON q.query_id = p.query_id INNER JOIN sys.query_store_runtime_stats rs ON p.plan_id = rs.plan_id ORDER BY rs.avg_cpu_time DESC; ``` ### Query Performance Comparison **Compare Plans Over Time:** ```sql -- Track query performance regression SELECT q.query_id, qt.query_sql_text, p.plan_id, rs.avg_duration / 1000.0 AS avg_duration_ms, rs.first_execution_time, rs.last_execution_time FROM sys.query_store_query q INNER JOIN sys.query_store_query_text qt ON q.query_text_id = qt.query_text_id INNER JOIN sys.query_store_plan p ON q.query_id = p.query_id INNER JOIN sys.query_store_runtime_stats rs ON p.plan_id = rs.plan_id WHERE q.query_id = 123 -- Specific query ORDER BY rs.first_execution_time; ``` ## SQL Server-Specific Optimizations ### Statistics Management **Update Statistics:** ```sql -- Update all statistics for table UPDATE STATISTICS Orders; -- Update specific index statistics UPDATE STATISTICS Orders IX_Orders_CustomerID; -- Update with full scan (more accurate) UPDATE STATISTICS Orders WITH FULLSCAN; ``` **Auto-Update Statistics:** ```sql -- Check setting SELECT name, is_auto_update_stats_on FROM sys.databases WHERE name = 'YourDatabase'; -- Enable auto-update ALTER DATABASE YourDatabase SET AUTO_UPDATE_STATISTICS ON; ``` ### Index Fragmentation **Check Fragmentation:** ```sql SELECT OBJECT_NAME(ips.object_id) AS TableName, i.name AS IndexName, ips.index_type_desc, ips.avg_fragmentation_in_percent, ips.page_count FROM sys.dm_db_index_physical_stats( DB_ID(), NULL, NULL, NULL, 'LIMITED' ) ips INNER JOIN sys.indexes i ON ips.object_id = i.object_id AND ips.index_id = i.index_id WHERE ips.avg_fragmentation_in_percent > 10 AND ips.page_count > 1000 ORDER BY ips.avg_fragmentation_in_percent DESC; ``` **Rebuild vs Reorganize:** ```sql -- Rebuild index (offline, faster, more thorough) ALTER INDEX IX_Orders_CustomerID ON Orders REBUILD; -- Rebuild online (SQL Server Enterprise) ALTER INDEX IX_Orders_CustomerID ON Orders REBUILD WITH (ONLINE = ON); -- Reorganize index (online, slower, less thorough) ALTER INDEX IX_Orders_CustomerID ON Orders REORGANIZE; ``` **Guidelines:** - **Fragmentation <10%**: No action needed - **Fragmentation 10-30%**: REORGANIZE - **Fragmentation >30%**: REBUILD ### Partitioning **Partition Function:** ```sql -- Create partition function (monthly partitions) CREATE PARTITION FUNCTION PF_Orders_Monthly (DATE) AS RANGE RIGHT FOR VALUES ('2025-01-01', '2025-02-01', '2025-03-01', ..., '2025-12-01'); ``` **Partition Scheme:** ```sql -- Create partition scheme CREATE PARTITION SCHEME PS_Orders_Monthly AS PARTITION PF_Orders_Monthly ALL TO ([PRIMARY]); ``` **Partitioned Table:** ```sql -- Create partitioned table CREATE TABLE Orders ( OrderID INT IDENTITY(1,1), OrderDate DATE, CustomerID INT, TotalAmount DECIMAL(10,2), ... ) ON PS_Orders_Monthly (OrderDate); ``` **Benefits:** - Partition elimination (scan only relevant partitions) - Partition switching (fast archive/purge) - Parallel query execution per partition ### Computed Columns **Persisted Computed Column:** ```sql ALTER TABLE Users ADD FullName AS (FirstName + ' ' + LastName) PERSISTED; -- Index computed column CREATE INDEX IX_Users_FullName ON Users (FullName); ``` **Non-Persisted Computed Column:** ```sql ALTER TABLE Users ADD EmailDomain AS (SUBSTRING(Email, CHARINDEX('@', Email) + 1, LEN(Email))); -- Computed on read, no storage overhead ``` ## SQL Server Configuration ### Max Degree of Parallelism (MAXDOP) **Check Current Setting:** ```sql SELECT value_in_use FROM sys.configurations WHERE name = 'max degree of parallelism'; ``` **Set MAXDOP:** ```sql -- Limit to 4 cores per query EXEC sp_configure 'max degree of parallelism', 4; RECONFIGURE; ``` **Recommendation:** - Small servers (<8 cores): MAXDOP = 4 - Large servers (>8 cores): MAXDOP = 8 - OLTP workloads: Lower MAXDOP (2-4) - Analytics workloads: Higher MAXDOP (8-16) ### Cost Threshold for Parallelism **Set Threshold:** ```sql -- Only parallelize queries with cost > 50 EXEC sp_configure 'cost threshold for parallelism', 50; RECONFIGURE; ``` **Default:** 5 (too low for most systems) **Recommended:** 50-100 ### Memory Configuration **Max Server Memory:** ```sql -- Reserve memory for OS and other apps -- Example: 32GB server, allocate 28GB to SQL Server EXEC sp_configure 'max server memory (MB)', 28672; RECONFIGURE; ``` **Recommendation:** - Leave 4GB for OS on small servers - Leave 10-20% for OS on large servers ## SQL Server Monitoring ### DMVs for Query Performance **Most Expensive Queries (CPU):** ```sql SELECT TOP 10 SUBSTRING(qt.text, (qs.statement_start_offset/2)+1, ((CASE qs.statement_end_offset WHEN -1 THEN DATALENGTH(qt.text) ELSE qs.statement_end_offset END - qs.statement_start_offset)/2)+1) AS query_text, qs.execution_count, qs.total_worker_time / 1000 AS total_cpu_ms, qs.total_worker_time / qs.execution_count / 1000 AS avg_cpu_ms, qs.last_execution_time FROM sys.dm_exec_query_stats qs CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) qt ORDER BY qs.total_worker_time DESC; ``` **Missing Indexes:** ```sql SELECT mid.database_id, mid.object_id, OBJECT_NAME(mid.object_id, mid.database_id) AS table_name, migs.avg_user_impact, migs.user_seeks, mid.equality_columns, mid.inequality_columns, mid.included_columns FROM sys.dm_db_missing_index_details mid INNER JOIN sys.dm_db_missing_index_groups mig ON mid.index_handle = mig.index_handle INNER JOIN sys.dm_db_missing_index_group_stats migs ON mig.index_group_handle = migs.group_handle WHERE migs.avg_user_impact > 50 ORDER BY migs.avg_user_impact DESC; ``` ### Blocking and Waits **Current Blocking:** ```sql SELECT blocking.session_id AS blocking_session, blocked.session_id AS blocked_session, blocked_text.text AS blocked_query, blocking_text.text AS blocking_query FROM sys.dm_exec_requests blocked INNER JOIN sys.dm_exec_requests blocking ON blocked.blocking_session_id = blocking.session_id CROSS APPLY sys.dm_exec_sql_text(blocked.sql_handle) blocked_text CROSS APPLY sys.dm_exec_sql_text(blocking.sql_handle) blocking_text; ``` ## Quick Reference ### Index Types | Index Type | Use Case | Creation | |-----------|----------|----------| | Clustered | Primary key, range queries | `CREATE CLUSTERED INDEX` | | Non-Clustered | Secondary lookups | `CREATE NONCLUSTERED INDEX` | | Covering | Index-only scans | `CREATE INDEX ... INCLUDE (...)` | | Filtered | Partial index | `CREATE INDEX ... WHERE ...` | | Columnstore | Analytics, DW | `CREATE COLUMNSTORE INDEX` | ### Execution Plan Operations | Operation | Performance | Action | |-----------|-------------|--------| | Clustered Index Seek | Excellent | Keep | | Index Seek | Excellent | Keep | | Index Scan | Fair | Consider covering index | | Clustered Index Scan | Poor | Add index or acceptable for small tables | | Table Scan | Worst | Add index | | Key Lookup | Moderate | Consider covering index | ### Configuration Priorities 1. **Max Server Memory**: Leave 4GB+ for OS 2. **MAXDOP**: 4-8 for most workloads 3. **Cost Threshold for Parallelism**: 50-100 4. **Query Store**: Enable for all databases ```