PostgreSQL extensions can do far more than just add new data types; they can fundamentally alter how the database core operates.
Let’s see how pg_stat_statements works by looking at some actual data.
First, we need to enable the extension and configure PostgreSQL to load it. This involves two steps:
-
Install the extension package: On most Linux systems, this is done via the distribution’s package manager. For example, on Debian/Ubuntu:
sudo apt-get install postgresql-contribOr on RHEL/CentOS:
sudo yum install postgresql-contribThis places the extension’s shared library and control files in the appropriate PostgreSQL directories.
-
Enable the extension in
postgresql.confand load it: Edit yourpostgresql.conffile (usually located in/etc/postgresql/<version>/main/or/var/lib/pgsql/<version>/data/). You’ll need to add two lines:shared_preload_libraries = 'pg_stat_statements' pg_stat_statements.track = allshared_preload_libraries: This tells PostgreSQL to load thepg_stat_statementslibrary into every backend process when it starts. This is crucial becausepg_stat_statementsneeds to hook into query execution before they run to collect data.pg_stat_statements.track: This setting controls what types of statements are tracked.alltracks everything,toptracks only top-level statements (not subqueries), andnonedisables tracking.allis the most comprehensive for performance analysis.
After modifying
postgresql.conf, you must restart your PostgreSQL server for these changes to take effect.sudo systemctl restart postgresql -
Create the extension in your database: Once the server is restarted, connect to the specific database where you want to use the extension (e.g., your application’s database) and run:
CREATE EXTENSION pg_stat_statements;This command creates the necessary tables and functions for
pg_stat_statementswithin that database.
Now, pg_stat_statements is actively collecting data. You can query its view to see the statistics:
SELECT
calls,
total_exec_time,
rows,
shared_blks_hit,
shared_blks_read,
query
FROM
pg_stat_statements
ORDER BY
total_exec_time DESC
LIMIT 10;
Here’s what you might see:
| calls | total_exec_time | rows | shared_blks_hit | shared_blks_read | query |
|---|---|---|---|---|---|
| 15000 | 123456.789 | 30000 | 1500000 | 50000 | SELECT * FROM users WHERE registration_date < $1 |
| 500 | 98765.432 | 500 | 200000 | 10000 | UPDATE products SET price = price * $1 WHERE category_id = $2 |
| 10000 | 50000.123 | 10000 | 500000 | 20000 | SELECT COUNT(*) FROM orders WHERE order_date BETWEEN $1 AND $2 |
| 2000 | 25000.567 | 0 | 100000 | 5000 | INSERT INTO logs (message, timestamp) VALUES ($1, NOW()) |
| 7500 | 15000.890 | 15000 | 300000 | 8000 | SELECT id, name FROM customers WHERE status = $1 ORDER BY last_order_date DESC |
calls: The number of times the query has been executed.total_exec_time: The total time spent executing the query, in milliseconds. This is a primary indicator of "expensive" queries.rows: The total number of rows returned or affected by the query.shared_blks_hit: The number of times a block was found in the shared buffer cache.shared_blks_read: The number of times a block had to be read from disk. Highshared_blks_readrelative toshared_blks_hitoften indicates I/O bottlenecks.query: The normalized query string. Notice parameters like$1,$2.pg_stat_statementsgroups identical queries with different parameter values.
The Problem It Solves:
Without pg_stat_statements, identifying which specific SQL queries are consuming the most resources (CPU, I/O, time) is a manual, often painful process. You might look at pg_stat_activity, but that only shows currently running queries. pg_stat_statements provides historical aggregate data, allowing you to pinpoint slow or frequently executed queries over longer periods.
How It Works Internally:
When you set shared_preload_libraries = 'pg_stat_statements', PostgreSQL loads the pg_stat_statements.so shared object into memory when the server starts. This library then registers hooks into various stages of the query processing pipeline. Specifically, it intercepts the query planning and execution phases.
When a query is executed, pg_stat_statements looks up the normalized version of the query (e.g., SELECT * FROM users WHERE id = $1) in an internal hash table.
- If the normalized query is already in the table, it increments the
callscount for that entry and adds the current query’s execution time, rows processed, and block I/O statistics to the existing totals. - If the normalized query is not in the table, it creates a new entry and initializes its statistics.
The "normalization" process is key: it replaces literal values with parameter placeholders ($1, $2, etc.). This is why SELECT * FROM users WHERE id = 1 and SELECT * FROM users WHERE id = 123 are treated as the same query for statistical purposes, preventing the statistics table from exploding with near-identical queries.
The pg_stat_statements.track setting determines when this interception happens. all means it tracks both top-level queries and any subqueries, CTEs, or statements within functions. top only tracks queries initiated by a client.
The Levers You Control:
pg_stat_statements.track: As discussed,allvs.topvs.none.allis the most useful for deep analysis but can have a slightly higher overhead.pg_stat_statements.max: This parameter limits the number of distinct normalized statements that can be stored. The default is 5000. If you exceed this, the least used statements are discarded to make room. For busy systems with many unique queries, you might need to increase this, e.g.,pg_stat_statements.max = 20000.pg_stat_statements.save: If set toon(the default isoff), the statistics are preserved across server restarts. This is invaluable for long-term trend analysis, but it means the statistics table grows indefinitely. Ifoff, statistics are reset on every restart.pg_stat_statements.track_utility: Ifon(default isoff), it also tracks utility commands likeCREATE TABLE,ALTER TABLE,VACUUM, etc. This can be useful for understanding DDL performance.pg_stat_statements.track_planning: Ifon(default isoff), it also tracks the time spent planning queries, not just executing them. This is crucial for identifying planning-intensive queries.
The pg_stat_statements.utility setting, when enabled, also captures metadata about the execution plan itself, including the plan node types used and the estimated cost, which can be invaluable for deep query optimization.
The next step after analyzing query performance is often to look at database indexing strategies, which pg_stat_statements can indirectly inform by highlighting queries that perform sequential scans on large tables.