Databases don’t actually "know" what data you want; they just have a really good bookkeeping system for finding it, and that system is called an index.

Let’s watch some indexes in action. Imagine a table of customers with millions of rows, each having an id, name, and signup_date.

CREATE TABLE customers (
    id INT PRIMARY KEY,
    name VARCHAR(255),
    signup_date DATE
);

-- Populate with millions of rows...

Without an index on name, a query like SELECT * FROM customers WHERE name = 'Alice'; would have to scan every single row in the table. This is a full table scan.

Now, let’s add an index. This is like creating an alphabetical index at the back of a book.

CREATE INDEX idx_customers_name ON customers (name);

With idx_customers_name, the database can now jump directly to the entries where name is 'Alice'. Instead of millions of comparisons, it’s a few hundred at most, often much less. The query plan changes from Seq Scan to Index Scan.

-- To see the plan:
EXPLAIN ANALYZE SELECT * FROM customers WHERE name = 'Alice';

The EXPLAIN ANALYZE output will show a significant speedup, often orders of magnitude.

The core problem indexes solve is reducing the number of data pages the database has to read from disk (or memory). Disk I/O is the slowest part of any database operation. Indexes are data structures, most commonly B-trees, that store a subset of column values and pointers to the actual rows. When you query a column with an index, the database traverses the B-tree to find the relevant pointers, then fetches only those specific rows.

The exact levers you control are:

  1. Which columns to index: Index columns frequently used in WHERE clauses, JOIN conditions, and ORDER BY clauses.
  2. Composite indexes: Indexing multiple columns together (e.g., INDEX ON (col_a, col_b)) is crucial when queries filter on both col_a and col_b simultaneously. The order matters: (col_a, col_b) is efficient for WHERE col_a = X AND col_b = Y and WHERE col_a = X, but not for WHERE col_b = Y.
  3. Index selectivity: An index is more effective if the values in the indexed column are unique or nearly unique. Indexing a boolean is_active column with 99% true values is rarely helpful on its own.
  4. Index types: Beyond B-trees, there are specialized indexes like GIN or GiST for full-text search, geospatial data, or JSONB data, each optimized for different query patterns.

A common misconception is that more indexes are always better. Each index adds overhead. Inserts, updates, and deletes become slower because the database must update every relevant index. Indexes also consume disk space. The art is in finding the minimal set of indexes that cover your most critical query workloads.

When you create an index on (col_a, col_b), the database builds a B-tree where the primary sort key is col_a, and for rows with the same col_a, they are sorted by col_b. This means the index can efficiently answer queries filtering on col_a alone, or on both col_a and col_b together. It can sometimes answer queries filtering on col_b alone if the index is structured in a specific, less common way (like reverse order) or if the query planner can do a "full index scan" and filter afterward, but generally, the leading column is king for direct lookups.

The next step is understanding how to tune queries that aren’t using your indexes effectively, often due to data type mismatches or complex expression evaluation.

Want structured learning?

Take the full Performance course →