Database storage and query costs are often the biggest line items in cloud infrastructure bills, but they’re also the most amenable to optimization.
Let’s look at an example of a PostgreSQL database and how we can tune it.
-- Sample table
CREATE TABLE orders (
order_id SERIAL PRIMARY KEY,
customer_id INT,
product_id INT,
order_date TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
quantity INT,
price NUMERIC(10, 2),
status VARCHAR(50)
);
-- Sample data insertion
INSERT INTO orders (customer_id, product_id, quantity, price, status) VALUES
(101, 201, 2, 19.99, 'Shipped'),
(102, 205, 1, 45.50, 'Processing'),
(101, 203, 3, 10.00, 'Delivered');
-- A common query that might be slow and expensive
SELECT
o.order_id,
o.order_date,
c.customer_name,
p.product_name,
o.quantity,
o.price
FROM
orders o
JOIN
customers c ON o.customer_id = c.customer_id
JOIN
products p ON o.product_id = p.product_id
WHERE
o.order_date >= '2023-01-01' AND o.status = 'Shipped';
The most surprising truth about database cost optimization is that it’s rarely about "faster queries" in the sense of reducing latency for a single request. It’s almost always about reducing the amount of work the database has to do, which directly translates to less CPU, I/O, and memory usage, and therefore, lower costs. This means optimizing storage and query execution patterns together.
Storage Optimization
Databases consume storage for data, indexes, and transaction logs. Each has different optimization strategies.
1. Data Compression:
Most databases offer data compression at various levels. For PostgreSQL, this often involves extensions like pg_partman for partitioning (which can then be combined with compression on older partitions) or using features like TOAST (The Oversized-Attribute Storage Technique) which automatically compresses large fields.
- Diagnosis: Check table and index sizes.
SELECT pg_size_pretty(pg_total_relation_size('orders')); - Fix: For TOASTable columns (like
TEXT,BYTEA,JSONB,VARCHAR), compression is often automatic. For other types or to enforce compression on larger tables, consider partitioning and then using compression on older, less accessed partitions. For example, withpg_partman, you can configure automatic detachment and compression of older partitions.-- Example of enabling compression on a table (hypothetical, depends on extension) ALTER TABLE orders SET (compression = 'lz4'); -- This syntax is illustrative, actual command depends on the compression method/extension - Why it works: Compression reduces the number of bytes that need to be read from disk and stored, directly lowering storage costs and improving I/O performance for queries that need to scan large amounts of data.
2. Table Partitioning:
For large tables, especially those with time-series data like orders, partitioning can significantly reduce query costs and storage management overhead. Partitioning splits a large table into smaller, more manageable pieces based on a key (e.g., date).
- Diagnosis: Identify large tables and their access patterns. If queries frequently filter by date, partitioning by date is a strong candidate.
SELECT relname, pg_size_pretty(pg_total_relation_size(oid)) AS total_size FROM pg_class WHERE relkind = 'r' -- 'r' for relation (table) ORDER BY pg_total_relation_size(oid) DESC LIMIT 10; - Fix: Implement declarative partitioning in PostgreSQL (version 10+). For example, partition
ordersbyorder_dateinto monthly partitions.-- Create the parent table CREATE TABLE orders ( order_id SERIAL, customer_id INT, product_id INT, order_date TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP, quantity INT, price NUMERIC(10, 2), status VARCHAR(50) ) PARTITION BY RANGE (order_date); -- Create partitions (example for one month) CREATE TABLE orders_2023_01 PARTITION OF orders FOR VALUES FROM ('2023-01-01 00:00:00+00') TO ('2023-02-01 00:00:00+00'); -- Use pg_partman for automated management -- Install and configure pg_partman extension -- Then, use its functions to create and manage partitions automatically. - Why it works: Queries that filter by the partition key (e.g.,
WHERE order_date >= '2023-01-01') can prune partitions, meaning the database only scans relevant partitions, drastically reducing I/O and CPU. Older, unneeded partitions can be detached, archived, or dropped without affecting the active data.
3. Index Management: Unused or redundant indexes consume storage and slow down writes. Conversely, missing indexes on frequently queried columns can lead to full table scans.
- Diagnosis: Identify unused indexes.
SELECT schemaname, relname, indexrelname, pg_size_pretty(pg_relation_size(indexrelid)) AS index_size, idx_scan AS reads, idx_tup_read AS tuples_read FROM pg_stat_user_indexes WHERE idx_scan = 0 AND indexrelname NOT LIKE '%_pkey' -- Exclude primary key indexes ORDER BY pg_relation_size(indexrelid) DESC; - Fix: Drop unused indexes.
DROP INDEX IF EXISTS unused_index_name; - Why it works: Removing indexes reduces storage footprint and speeds up
INSERT,UPDATE, andDELETEoperations because fewer index structures need to be maintained.
Query Cost Optimization
Optimizing queries is about reducing the number of blocks read from disk and the amount of CPU/memory used.
1. Indexing Strategy:
Ensure appropriate indexes exist for WHERE clauses, JOIN conditions, and ORDER BY clauses.
- Diagnosis: Use
EXPLAIN ANALYZEon your slow queries. Look for "Seq Scan" (Sequential Scan) on large tables where a targeted index could be used, and "Bitmap Heap Scan" with a large number of rows fetched.EXPLAIN ANALYZE SELECT o.order_id, o.order_date, c.customer_name, p.product_name, o.quantity, o.price FROM orders o JOIN customers c ON o.customer_id = c.customer_id JOIN products p ON o.product_id = p.product_id WHERE o.order_date >= '2023-01-01' AND o.status = 'Shipped'; - Fix: Create composite indexes that cover the
WHEREandJOINconditions. The order of columns in the index matters.CREATE INDEX idx_orders_date_status ON orders (order_date, status); CREATE INDEX idx_orders_customer_id ON orders (customer_id); CREATE INDEX idx_orders_product_id ON orders (product_id); - Why it works: Indexes allow the database to quickly locate specific rows without scanning the entire table, dramatically reducing I/O and CPU for filtered queries. A composite index on
(order_date, status)can efficiently serve queries filtering on both columns.
2. Query Rewriting & Avoiding SELECT *:
Fetch only the columns you need. Avoid SELECT *.
- Diagnosis: Review application code for
SELECT *. Check query plans for unnecessary columns being fetched. - Fix: Explicitly list the required columns.
-- Instead of: SELECT * FROM orders WHERE order_id = 123; SELECT order_id, customer_id, order_date, price FROM orders WHERE order_id = 123; - Why it works: Reduces the amount of data read from disk (if indexes don’t cover all requested columns) and transferred over the network, saving I/O and bandwidth costs.
3. Understanding VACUUM and ANALYZE:
PostgreSQL uses MVCC (Multi-Version Concurrency Control). VACUUM reclaims space from dead rows, and ANALYZE updates statistics for the query planner. Autovacuum usually handles this, but misconfiguration or very high transaction rates can lead to issues.
- Diagnosis: Check for high
xidwraparound risks or stale statistics.SELECT datname, age(datfrozenxid) FROM pg_database ORDER BY age(datfrozenxid) DESC LIMIT 5; -- Check autovacuum settings SHOW autovacuum; SHOW autovacuum_vacuum_threshold; SHOW autovacuum_analyze_threshold; - Fix: Ensure autovacuum is enabled and properly tuned. Manually run
VACUUM ANALYZEon heavily modified tables if needed.VACUUM ANALYZE orders; - Why it works: Proper vacuuming prevents table bloat and keeps statistics up-to-date, ensuring the query planner makes optimal choices, which indirectly reduces query costs.
4. Connection Pooling: While not directly a query/storage cost, establishing database connections is resource-intensive. Connection pooling reduces this overhead.
- Diagnosis: High number of idle connections in the database, or frequent connection/disconnection logs.
SELECT count(*), state FROM pg_stat_activity GROUP BY state; - Fix: Implement a connection pooler like PgBouncer or use built-in pooling in application frameworks. Configure it with appropriate pool sizes (e.g.,
max_client_conn = 100in PgBouncer). - Why it works: Reuses existing connections, reducing CPU and memory overhead on the database server for connection management, thereby lowering overall operational costs.
The next problem you’ll likely encounter after optimizing storage and query costs is managing the lifecycle of your data and ensuring that older, less frequently accessed data is moved to cheaper storage tiers or archived.