Databases don’t actually "look up" data like a phone book; they scan.
Let’s say you have a users table with millions of rows and you want to find a specific user by their email address.
SELECT * FROM users WHERE email = 'alice@example.com';
Without an index on the email column, the database has to read every single row in the users table, check if the email matches 'alice@example.com', and only then return the matching row. This is a full table scan, and it’s incredibly slow for large tables.
Here’s a typical scenario:
-- Sample users table
CREATE TABLE users (
id INT PRIMARY KEY AUTO_INCREMENT,
username VARCHAR(50) NOT NULL,
email VARCHAR(100) NOT NULL,
signup_date DATE
);
-- Insert a million dummy users
INSERT INTO users (username, email, signup_date)
SELECT
CONCAT('user_', n),
CONCAT('user_', n, '@example.com'),
DATE_SUB(CURDATE(), INTERVAL FLOOR(RAND() * 3650) DAY)
FROM (
SELECT a.N + b.N * 10 + c.N * 100 + d.N * 1000 + e.N * 10000 + f.N * 100000 AS N
FROM (SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) a
CROSS JOIN (SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) b
CROSS JOIN (SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) c
CROSS JOIN (SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) d
CROSS JOIN (SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) e
CROSS JOIN (SELECT 0 AS N UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) f
) AS numbers
LIMIT 1000000;
Now, let’s run a query without an index and see how long it takes. We’ll use EXPLAIN to see the execution plan.
EXPLAIN SELECT * FROM users WHERE email = 'user_500000@example.com';
You’ll likely see type: ALL and a rows value close to your total table size. This indicates a full table scan. On a large table, this can take seconds or even minutes.
The solution is to create an index. An index is a separate data structure that the database uses to speed up data retrieval. Think of it like the index at the back of a book, which lists keywords and the pages they appear on.
CREATE INDEX idx_users_email ON users (email);
After creating the index, run the EXPLAIN command again:
EXPLAIN SELECT * FROM users WHERE email = 'user_500000@example.com';
Now, you should see type: ref (or eq_ref if it’s a unique index and the primary key is also involved) and a rows value much smaller than your total table size, often just 1. The database can now use the index to directly locate the row(s) matching the email, making the query incredibly fast. The index essentially organizes the email values in a sorted structure (like a B-tree), allowing for logarithmic time lookups instead of linear scans.
The EXPLAIN output is your primary tool for understanding how the database executes a query. Look for type: ALL (bad), type: index (better, but still scanning the whole index), type: range, type: ref, type: eq_ref, and type: const (increasingly good). The rows column gives an estimate of how many rows the database thinks it will need to examine. A high rows number for a query that should be fast is a red flag.
Beyond simple indexes, consider composite indexes for queries that filter on multiple columns. If you frequently run queries like SELECT * FROM orders WHERE user_id = 123 AND order_date BETWEEN '2023-01-01' AND '2023-12-31', a composite index on (user_id, order_date) will be much more effective than two separate indexes. The order of columns in a composite index matters; put the column with the highest cardinality (most distinct values) or the one most frequently used in equality checks first.
Sometimes, even with indexes, queries can be slow if the database chooses the wrong index or if the data distribution is skewed. This is where query optimization comes in. You can often influence the query planner using hints (though use these sparingly) or by rewriting the query. For example, functions applied to indexed columns in the WHERE clause (like WHERE YEAR(signup_date) = 2023) prevent index usage. Instead, rewrite it as WHERE signup_date BETWEEN '2023-01-01' AND '2023-12-31'.
The database’s query optimizer tries to find the most efficient way to execute your SQL. It considers available indexes, table statistics, and various execution strategies. When a query is slow, it’s often because the optimizer doesn’t have enough information (outdated statistics) or it’s making a suboptimal choice based on the available information. Running ANALYZE TABLE users; can update table statistics, helping the optimizer make better decisions.
One common misconception is that more indexes are always better. While indexes speed up reads, they incur overhead on writes (INSERT, UPDATE, DELETE) because the index itself must also be updated. For tables with very high write volumes, you need to carefully balance the number of indexes against read performance needs. Too many indexes can actually slow down your overall application.
After optimizing your indexes and queries, the next challenge is often connection pooling and managing the number of open database connections efficiently.