Database Indexing: Beyond the Basics

The most surprising thing about database indexing is that sometimes, the absence of an index is the fastest way to query your data.

Let’s see how this plays out. Imagine you have a table users with millions of rows, and you frequently query for users by their email address.

-- Without an index, this can scan the entire table
SELECT * FROM users WHERE email = 'alice@example.com';

If you add a standard B-tree index on email, queries like the one above become lightning fast. The database can quickly navigate the B-tree to find the exact row(s) matching 'alice@example.com'.

-- Create a B-tree index
CREATE INDEX idx_users_email ON users (email);

-- Now this query is very fast
SELECT * FROM users WHERE email = 'alice@example.com';

B-trees are the workhorse of database indexing. They’re balanced tree structures, meaning the path from the root to any leaf node is roughly the same length. This guarantees predictable performance for equality lookups (=), range queries (>, <, BETWEEN), and prefix matching (LIKE 'abc%'). They are generally good for most data types and query patterns.

However, B-trees aren’t always the answer. Consider a table where you frequently search for articles containing specific keywords. A B-tree on a text column would be incredibly inefficient because it would have to store and traverse huge amounts of data for each word. This is where GIN (Generalized Inverted Index) comes in.

A GIN index is like a dictionary for your data. Instead of storing rows, it stores unique values (or "terms") and a list of the rows that contain each term. For text, it breaks down the text into individual words (tokens).

-- Assume 'articles' table with a 'content' column
CREATE INDEX idx_articles_content_gin ON articles USING GIN (to_tsvector('english', content));

-- Query for articles containing 'database' AND 'indexing'
SELECT * FROM articles WHERE to_tsvector('english', content) @@ to_tsquery('english', 'database & indexing');

GIN indexes are fantastic for searching within arrays, JSON documents, and full-text search. They excel at finding rows that match any of multiple conditions or all of multiple conditions. The trade-off is that GIN indexes are larger and slower to update than B-trees because they need to rebuild parts of the index for every insert/update/delete.

Now, let’s look at Hash indexes. These are simpler. They compute a hash value for each indexed column value and store it in a hash table. Hash indexes are extremely fast for exact equality lookups (=).

-- Create a Hash index
CREATE INDEX idx_products_sku_hash ON products (sku);

-- This query is very fast
SELECT * FROM products WHERE sku = 'XYZ789';

The problem? Hash indexes cannot efficiently handle range queries or sorting. If you try WHERE sku > 'XYZ789', the database has to ignore the hash index and scan the table. Because of this limitation, many databases (like PostgreSQL) have made B-trees the default and often don’t even support hash indexes directly, or they are less performant in practice than B-trees for general use.

Finally, there are BRIN (Block Range INdex) indexes. These are a space-saving marvel for large tables where data is naturally correlated with its physical storage order. BRIN indexes store the minimum and maximum values for a column within each block (or range of blocks) of the table.

Imagine a table of sensor readings ordered by timestamp. A BRIN index on the timestamp column would store the earliest and latest timestamp for each chunk of data on disk.

-- Assume 'sensor_readings' table with a 'reading_time' column, naturally ordered
CREATE INDEX idx_sensor_readings_time_brin ON sensor_readings USING BRIN (reading_time);

-- Query for readings within a specific hour
SELECT * FROM sensor_readings WHERE reading_time BETWEEN '2023-10-27 10:00:00' AND '2023-10-27 11:00:00';

If the reading_time range of a block falls entirely outside your query’s range, the database can skip reading that entire block. This makes BRIN indexes incredibly small and fast to build, but they only work well when the indexed column’s values are physically correlated with their storage location. If your data is inserted randomly, BRIN indexes offer no benefit and can even hurt performance.

One crucial detail about GIN indexes is their fast update parameter. By default, GIN indexes are not "fast updated." This means that every time you insert or update a row, the entire GIN index for that row must be rebuilt in memory, and only flushed to disk periodically. This can lead to significant performance bottlenecks for write-heavy workloads. By setting fastupdate = true during index creation, you enable a different update mechanism where changes are batched and applied more incrementally, significantly improving write performance at the cost of a slightly larger index and potentially slower reads.

The next frontier you’ll explore is how to combine these indexing strategies, perhaps using partial indexes or expression indexes, to tailor your database’s performance even further.