PostgreSQL Power-Ups: CTEs, Window Functions, Indexing

PostgreSQL extensions can do far more than just add new data types; they can fundamentally alter how the database core operates.

Let’s see how pg_stat_statements works by looking at some actual data.

First, we need to enable the extension and configure PostgreSQL to load it. This involves two steps:

Install the extension package: On most Linux systems, this is done via the distribution’s package manager. For example, on Debian/Ubuntu:
```
sudo apt-get install postgresql-contrib
```
Or on RHEL/CentOS:
```
sudo yum install postgresql-contrib
```
This places the extension’s shared library and control files in the appropriate PostgreSQL directories.
Enable the extension in postgresql.conf and load it: Edit your postgresql.conf file (usually located in /etc/postgresql/<version>/main/ or /var/lib/pgsql/<version>/data/). You’ll need to add two lines:
```
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.track = all
```
- shared_preload_libraries: This tells PostgreSQL to load the pg_stat_statements library into every backend process when it starts. This is crucial because pg_stat_statements needs to hook into query execution before they run to collect data.
- pg_stat_statements.track: This setting controls what types of statements are tracked. all tracks everything, top tracks only top-level statements (not subqueries), and none disables tracking. all is the most comprehensive for performance analysis.
After modifying postgresql.conf, you must restart your PostgreSQL server for these changes to take effect.
```
sudo systemctl restart postgresql
```
Create the extension in your database: Once the server is restarted, connect to the specific database where you want to use the extension (e.g., your application’s database) and run:
```
CREATE EXTENSION pg_stat_statements;
```
This command creates the necessary tables and functions for pg_stat_statements within that database.

Now, pg_stat_statements is actively collecting data. You can query its view to see the statistics:

SELECT
    calls,
    total_exec_time,
    rows,
    shared_blks_hit,
    shared_blks_read,
    query
FROM
    pg_stat_statements
ORDER BY
    total_exec_time DESC
LIMIT 10;

Here’s what you might see:

calls	total_exec_time	rows	shared_blks_hit	shared_blks_read	query
15000	123456.789	30000	1500000	50000	SELECT * FROM users WHERE registration_date < $1
500	98765.432	500	200000	10000	UPDATE products SET price = price * $1 WHERE category_id = $2
10000	50000.123	10000	500000	20000	SELECT COUNT(*) FROM orders WHERE order_date BETWEEN $1 AND $2
2000	25000.567	0	100000	5000	INSERT INTO logs (message, timestamp) VALUES ($1, NOW())
7500	15000.890	15000	300000	8000	SELECT id, name FROM customers WHERE status = $1 ORDER BY last_order_date DESC

calls: The number of times the query has been executed.
total_exec_time: The total time spent executing the query, in milliseconds. This is a primary indicator of "expensive" queries.
rows: The total number of rows returned or affected by the query.
shared_blks_hit: The number of times a block was found in the shared buffer cache.
shared_blks_read: The number of times a block had to be read from disk. High shared_blks_read relative to shared_blks_hit often indicates I/O bottlenecks.
query: The normalized query string. Notice parameters like $1, $2. pg_stat_statements groups identical queries with different parameter values.

The Problem It Solves: Without pg_stat_statements, identifying which specific SQL queries are consuming the most resources (CPU, I/O, time) is a manual, often painful process. You might look at pg_stat_activity, but that only shows currently running queries. pg_stat_statements provides historical aggregate data, allowing you to pinpoint slow or frequently executed queries over longer periods.

How It Works Internally: When you set shared_preload_libraries = 'pg_stat_statements', PostgreSQL loads the pg_stat_statements.so shared object into memory when the server starts. This library then registers hooks into various stages of the query processing pipeline. Specifically, it intercepts the query planning and execution phases.

When a query is executed, pg_stat_statements looks up the normalized version of the query (e.g., SELECT * FROM users WHERE id = $1) in an internal hash table.

If the normalized query is already in the table, it increments the calls count for that entry and adds the current query’s execution time, rows processed, and block I/O statistics to the existing totals.
If the normalized query is not in the table, it creates a new entry and initializes its statistics.

The "normalization" process is key: it replaces literal values with parameter placeholders ($1, $2, etc.). This is why SELECT * FROM users WHERE id = 1 and SELECT * FROM users WHERE id = 123 are treated as the same query for statistical purposes, preventing the statistics table from exploding with near-identical queries.

The pg_stat_statements.track setting determines when this interception happens. all means it tracks both top-level queries and any subqueries, CTEs, or statements within functions. top only tracks queries initiated by a client.

The Levers You Control:

pg_stat_statements.track: As discussed, all vs. top vs. none. all is the most useful for deep analysis but can have a slightly higher overhead.
pg_stat_statements.max: This parameter limits the number of distinct normalized statements that can be stored. The default is 5000. If you exceed this, the least used statements are discarded to make room. For busy systems with many unique queries, you might need to increase this, e.g., pg_stat_statements.max = 20000.
pg_stat_statements.save: If set to on (the default is off), the statistics are preserved across server restarts. This is invaluable for long-term trend analysis, but it means the statistics table grows indefinitely. If off, statistics are reset on every restart.
pg_stat_statements.track_utility: If on (default is off), it also tracks utility commands like CREATE TABLE, ALTER TABLE, VACUUM, etc. This can be useful for understanding DDL performance.
pg_stat_statements.track_planning: If on (default is off), it also tracks the time spent planning queries, not just executing them. This is crucial for identifying planning-intensive queries.

The pg_stat_statements.utility setting, when enabled, also captures metadata about the execution plan itself, including the plan node types used and the estimated cost, which can be invaluable for deep query optimization.

The next step after analyzing query performance is often to look at database indexing strategies, which pg_stat_statements can indirectly inform by highlighting queries that perform sequential scans on large tables.