Postgres’s full-text search isn’t just about finding keywords; it’s a sophisticated system that analyzes and ranks documents based on linguistic relevance.
Let’s see it in action. Imagine a simple documents table:
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
title TEXT,
body TEXT
);
INSERT INTO documents (title, body) VALUES
('PostgreSQL Basics', 'This document covers the fundamental concepts of PostgreSQL.'),
('Advanced SQL Techniques', 'Explore advanced SQL features like window functions and CTEs.'),
('Full-Text Search in Postgres', 'Learn how to implement and optimize full-text search in PostgreSQL.'),
('Database Performance Tuning', 'Tips and tricks for improving your database performance.');
Now, let’s add a tsvector column and an index for efficient searching:
ALTER TABLE documents ADD COLUMN tsv tsvector;
UPDATE documents SET tsv = to_tsvector('english', title || ' ' || body);
CREATE INDEX ts_idx ON documents USING GIN(tsv);
The to_tsvector('english', ...) function takes our text and converts it into a special tsvector type. It stems words (e.g., "searching" becomes "search"), removes stop words (like "the", "a", "is"), and stores the remaining lexemes along with their positions. The GIN index on tsv makes querying this data lightning fast.
A typical query might look like this:
SELECT title
FROM documents
WHERE tsv @@ to_tsquery('english', 'postgres & search');
Here, @@ is the "matches" operator. to_tsquery('english', 'postgres & search') converts our search terms into a tsquery object. The & operator means both "postgres" and "search" must be present. This query will return "Full-Text Search in Postgres".
But what if we want to rank results by relevance? Postgres has us covered with ts_rank and ts_rank_cd.
SELECT title, ts_rank(tsv, to_tsquery('english', 'postgres & search')) AS rank
FROM documents
WHERE tsv @@ to_tsquery('english', 'postgres & search')
ORDER BY rank DESC;
ts_rank calculates a score based on how often the search terms appear in the document and how close they are to each other. ts_rank_cd (coverage density) is often preferred as it penalizes documents that have many occurrences of the search terms spread far apart, potentially indicating a less focused document.
The mental model for how this works internally is that each document is transformed into a bag of lexemes (normalized words) with positional information. The index, typically a GIN (Generalized Inverted Index), stores these lexemes and maps them back to the documents they appear in. When you query, Postgres looks up the lexemes from your tsquery in the GIN index, finds all documents containing those lexemes, and then uses the positional data and ranking functions to score and order the results. The tsvector itself is effectively a pre-processed, indexed representation of your text data, optimized for fast and relevant searching.
The real magic happens with the configuration of to_tsvector and to_tsquery. You can specify different text search configurations beyond just 'english'. These configurations define the parser (how text is broken into tokens), the dictionaries (for stemming, thesaurus, stop words), and the operations that transform tokens into lexemes. For example, you might use a configuration that handles hyphens differently or includes domain-specific stop words.
You can also use phrase searching and proximity matching within your queries. For instance, to_tsquery('english', 'postgres <-> search') will find documents where "postgres" and "search" appear adjacent to each other, which is often a stronger indicator of relevance than just having both words present.
The most surprising thing is how much control you have over the definition of "relevance" through these configurations and ranking functions. It’s not just about keyword matching; it’s about linguistic similarity, document structure, and user-defined importance. For instance, you can assign weights to different parts of your tsvector, giving more importance to matches in the title column than the body column by creating a weighted tsvector.
ALTER TABLE documents ADD COLUMN tsv_weighted tsvector;
UPDATE documents SET tsv_weighted =
setweight(to_tsvector('english', title), 'A') || ' ' ||
setweight(to_tsvector('english', body), 'D');
CREATE INDEX ts_weighted_idx ON documents USING GIN(tsv_weighted);
SELECT title, ts_rank_cd(tsv_weighted, to_tsquery('english', 'postgres & search')) AS rank
FROM documents
WHERE tsv_weighted @@ to_tsquery('english', 'postgres & search')
ORDER BY rank DESC;
Here, 'A' is the highest weight and 'D' is the lowest. This means matches in the title will contribute significantly more to the ranking score than matches in the body.
Understanding how to build and tune these text search configurations is the next frontier for truly powerful full-text search in Postgres.