Database Indexing Strategies for High-Performance Systems
You know that feeling when you're searching for a specific book in a library with thousands of volumes, and there's no catalog system? That's basically what your database experiences every time you run a query without proper indexing. Except instead of thousands of books, we're talking millions or billions of rows, and instead of minutes, we're burning milliseconds that add up into frustrated users and skyrocketing infrastructure costs.
Let's talk about indexing. Not the theoretical computer science version you might remember from university, but the real-world, battle-tested strategies that separate systems that crumble under load from those that scale gracefully.
The Foundation: What Indexes Actually Do
Before we get into the advanced stuff, let's make sure we're on the same page about what's happening under the hood. An index is a data structure that trades write performance and storage space for dramatically improved read performance. When you create an index, the database maintains an additional structure (usually a B-tree or hash table) that allows it to locate rows without scanning the entire table.
The classic analogy is a book's index, but I think that undersells what's happening. A better mental model is thinking about indexes as pre-computed answers to specific questions. Every index you create is basically telling the database: "I'm going to ask questions shaped like this frequently enough that it's worth maintaining this lookup structure."
The cost? Every write operation now has to update not just the table, but every index on that table. This is why the "just index everything" approach is a rookie mistake that I've seen bring down more than one production system.
B-Tree Indexes: The Workhorse
B-tree indexes (and their variant B+tree, which most modern databases actually use) are the default for good reason. They handle range queries beautifully, maintain sorted order, and provide O(log n) lookup time. PostgreSQL, MySQL, Oracle... they all default to B-tree variants.
Here's what makes B-trees special: they're balanced trees where each node can have multiple children. In a B+tree specifically, all the actual data pointers live in the leaf nodes, which are linked together. This means range scans are incredibly efficient. Once you find the start of your range, you just walk the linked list of leaf nodes.
CREATE INDEX idx_users_created_at ON users(created_at);
This simple index makes queries like WHERE created_at > '2024-01-01' lightning fast. But here's where it gets interesting: B-tree indexes are also useful for sorting. If you have ORDER BY created_at, the database can use this index to retrieve rows in sorted order without a separate sort operation.
The Left-Prefix Rule
Composite B-tree indexes have a quirk that trips up a lot of developers. The index is sorted by the first column, then the second, then the third. Just like a phone book is sorted by last name, then first name. This means an index on (last_name, first_name, age) can satisfy queries filtering on:
last_namelast_name, first_namelast_name, first_name, age
But it can't effectively help with queries filtering only on first_name or age. The column order matters a lot. When designing composite indexes, put your most selective columns first, unless you have specific query patterns that require otherwise.
Hash Indexes: When You Know Exactly What You Want
Hash indexes are the specialists in the index world. They're great at equality comparisons but completely useless for range queries. The database applies a hash function to the indexed column(s) and stores the result in a hash table.
CREATE INDEX idx_users_email_hash ON users USING HASH (email);
This shines for queries like WHERE email = 'user@example.com' and gives you O(1) lookup time. That's theoretically faster than B-tree's O(log n). But try WHERE email > 'a%' and the optimizer won't touch that hash index.
The gotcha? Until relatively recently, hash indexes in PostgreSQL weren't WAL-logged, meaning they weren't crash-safe. PostgreSQL 10 fixed this, but it's a good reminder to know your database version's quirks. MySQL's hash indexes only exist in MEMORY tables, which limits their practical use.
In practice, I reach for hash indexes rarely. Usually only when I have a high-volume lookup table where I'm exclusively doing equality checks and I've profiled to confirm B-tree isn't sufficient.
Covering Indexes: The Performance Multiplier
Here's a technique that can transform query performance: covering indexes. The idea is simple but powerful. Include all the columns your query needs directly in the index itself, so the database never has to touch the actual table.
CREATE INDEX idx_users_lookup ON users(email) INCLUDE (first_name, last_name, status);
Now a query like:
SELECT first_name, last_name, status FROM users WHERE email = 'user@example.com';
Can be satisfied entirely from the index. This is called an "index-only scan" and it's dramatically faster because:
- Index data is typically much smaller than full row data
- Indexes are stored separately and often cached more effectively
- You avoid the random I/O of looking up the actual row
The trade-off is obvious: larger indexes, slower writes. But for read-heavy workloads with specific query patterns, covering indexes are absolute gold.
Partial Indexes: Indexing What Matters
Why index rows you never query? Partial indexes let you create an index on only a subset of rows that match a specific condition.
CREATE INDEX idx_active_users_email ON users(email) WHERE status = 'active';
If 95% of your queries only care about active users, this index is smaller, faster to maintain, and more cache-friendly than indexing all users. I've seen partial indexes reduce index size by 80% in systems with soft deletes where deleted_at IS NULL is in almost every query.
The mental shift here is important: you're not just indexing columns, you're indexing specific access patterns. This is where understanding your actual query workload becomes critical.
Expression Indexes: When Data Isn't Stored How You Query It
Sometimes you need to query on a transformed version of your data. Expression indexes let you index the result of a function or expression rather than raw column values.
CREATE INDEX idx_users_lower_email ON users(LOWER(email));
Now case-insensitive email lookups are fast:
SELECT * FROM users WHERE LOWER(email) = 'user@example.com';
Without this index, even if you have a regular index on email, the database can't use it because you're querying LOWER(email), not email itself.
Common use cases include:
- Case-insensitive searches
- Date truncation (
DATE_TRUNC('day', created_at)) - JSON field extraction (
data->>'field_name') - Computed values (
price * quantity)
The caveat: your queries must use the exact same expression for the optimizer to recognize it can use the index.
Multi-Column Indexes vs Multiple Single-Column Indexes
This is a common point of confusion. Should you create one index on (country, city) or two separate indexes on country and city?
Modern databases can combine multiple single-column indexes using "bitmap index scans" or similar techniques, but there's nuance here:
Use a multi-column index when:
- You frequently query those columns together
- You need to support queries on the left-prefix columns
- You want optimal performance for a specific query pattern
Use separate single-column indexes when:
- Columns are queried independently in different queries
- You need maximum flexibility
- Your queries vary significantly in their filter combinations
In practice, I often start with separate indexes and create multi-column indexes when profiling reveals specific hot paths that would benefit.
GIN and GiST Indexes: For Complex Data Types
PostgreSQL's GIN (Generalized Inverted Index) and GiST (Generalized Search Tree) indexes handle data types that don't fit neatly into B-tree structures.
GIN Indexes
Perfect for:
- Full-text search
- JSONB queries
- Array containment (
WHERE tags @> ARRAY['postgresql'])
CREATE INDEX idx_posts_tags ON posts USING GIN(tags);
CREATE INDEX idx_documents_content ON documents USING GIN(to_tsvector('english', content));
GIN indexes are inverted indexes. They map each element (word, array value, JSON key) to the rows containing it. They're larger and slower to update than B-trees, but for the problems they solve, there's no alternative.
GiST Indexes
Ideal for:
- Geometric data
- Range types
- Full-text search (though GIN is usually better)
- Nearest-neighbor searches
CREATE INDEX idx_locations_point ON locations USING GiST(coordinates);
The distinguishing characteristic of GiST is that it's "lossy." It can have false positives that need to be rechecked against the actual data. This makes it space-efficient for certain data types where exact matching isn't possible.
Clustered vs Non-Clustered Indexes
This distinction is important, especially if you're working across different database systems.
Clustered Index: The table data is physically stored in the order of the index. There can only be one clustered index per table because the data can only be physically sorted one way. In MySQL's InnoDB, the primary key is the clustered index. In PostgreSQL, you can manually cluster a table once, but it doesn't stay clustered as data changes.
Non-Clustered Index: A separate structure that points back to the table data. You can have many of these.
The implication: range scans on a clustered index are incredibly fast because the data is physically next to each other on disk. Range scans on non-clustered indexes involve potentially random I/O as you jump around the table.
This is why choosing your primary key matters in MySQL. An auto-incrementing integer often makes sense because inserts append to the end of the table rather than causing page splits throughout.
Index Maintenance: The Unsexy But Critical Part
Indexes get worse over time. In PostgreSQL, this is called "bloat." In MySQL, it's fragmentation. As data is inserted, updated, and deleted, indexes become less efficient.
Monitoring index health:
-- PostgreSQL: check bloat
SELECT schemaname, tablename, indexname,
pg_size_pretty(pg_relation_size(indexrelid)) as index_size
FROM pg_stat_user_indexes;
-- MySQL: check fragmentation
SHOW TABLE STATUS WHERE Data_free > 0;
Rebuilding indexes:
-- PostgreSQL
REINDEX INDEX CONCURRENTLY idx_users_email;
-- MySQL
OPTIMIZE TABLE users;
The CONCURRENTLY keyword in PostgreSQL is critical for production systems. It allows the reindex to happen without locking out writes, though it takes longer and requires more disk space temporarily.
Set up monitoring for index bloat and schedule regular maintenance. I've seen query performance drop 10x simply because indexes hadn't been maintained in months.
Real-World Indexing Strategy
Here's how I approach indexing in production systems:
-
Start with the obvious: Primary keys, foreign keys, and columns in frequent WHERE clauses.
-
Profile before optimizing: Use
EXPLAIN ANALYZEreligiously. Don't guess about what needs indexing. Measure it. -
Monitor your slow query log: This tells you exactly what's hurting in production.
-
Consider the write/read ratio: Heavy write loads need fewer indexes. Analytics databases can be index-heavy.
-
Test with production-like data volumes: An index that helps with 1000 rows might hurt with 100 million rows.
-
Watch your index size: If your indexes are larger than your table, something's probably wrong.
The Index Anti-Patterns
Let me save you from mistakes I've made:
Over-indexing: Every index slows down writes. I've seen tables with 15+ indexes that became write bottlenecks. Be ruthless about dropping unused indexes.
Ignoring cardinality: Indexing a boolean column (true/false) rarely helps. The database still scans half the table. Index columns with high selectivity.
Not maintaining indexes: Set it and forget it doesn't work. Indexes need care and feeding.
Indexing for one-off queries: Don't index for that report that runs once a month at 3 AM. Just let it be slow or run it off a replica.
Forgetting about composite index column order: This trips up even experienced developers. The leftmost columns must be selective and commonly queried.
Wrapping Up
Indexing is one of those topics where theory meets practice in brutal ways. You can understand B-trees perfectly and still create terrible indexes if you don't understand your access patterns. On the flip side, you can create incredibly effective indexes with just a basic understanding if you profile well and iterate.
The databases we work with are sophisticated pieces of software with decades of optimization built in. Trust the query planner, but verify. Use EXPLAIN. Monitor your metrics. And remember: the best index is the one that's actually used.