Full-Text Search: Beyond Basic String Matching

Postgres’ full-text search (FTS) is more than just keyword matching; it’s a sophisticated system for understanding and querying human language.

Let’s see it in action. Imagine a simple table of articles:

CREATE TABLE articles (
    id SERIAL PRIMARY KEY,
    title TEXT,
    body TEXT
);

INSERT INTO articles (title, body) VALUES
('The Importance of PostgreSQL', 'PostgreSQL is a powerful, open-source relational database system known for its reliability, feature robustness, and performance.'),
('Advanced SQL Techniques', 'This article explores advanced SQL techniques, including window functions and common table expressions (CTEs) in PostgreSQL.'),
('Database Performance Tuning', 'Tuning database performance is crucial for applications. We will discuss indexing strategies and query optimization for PostgreSQL.');

Now, let’s set up FTS. We need two main components: tsvector and tsquery.

A tsvector is a data type that stores a document normalized and ready for searching. It’s essentially a list of lexemes (words stripped of their inflection) and their positions.

ALTER TABLE articles ADD COLUMN tsv tsvector;
UPDATE articles SET tsv = to_tsvector('english', title || ' ' || body);

Here, to_tsvector('english', ...) takes our text, converts it to lowercase, removes stop words (like "the", "a", "is"), and stems words (e.g., "running" becomes "run"). The tsv column now holds this processed data.

To make this efficient, we add a GIN (Generalized Inverted Index) index:

CREATE INDEX idx_articles_tsv ON articles USING GIN(tsv);

A tsquery is another data type representing a query. It’s also processed, typically in a similar way to tsvector, and uses boolean operators (& for AND, | for OR, ! for NOT) and phrase searching (<-> for FOLLOWED BY).

Now, we can search. To find articles containing "PostgreSQL" and "performance":

SELECT id, title FROM articles
WHERE tsv @@ to_tsquery('english', 'PostgreSQL & performance');

This query uses the @@ operator, which checks if the tsvector matches the tsquery. The result will be articles that have both "PostgreSQL" and "performance" (or their variations) in their title or body.

The real power comes from understanding how the processing works. The text search configuration (here, 'english') is key. It defines the dictionary, stop words, and stemming rules. Postgres comes with several built-in configurations, but you can create custom ones.

For example, if you want to search for "databases" and "DBs" as the same concept, you might need a custom dictionary or synonym rules, which are configured within a text search configuration. You can inspect a configuration using pg_ts_config_map:

SELECT * FROM pg_ts_config_map WHERE cfgname = 'english';

This shows you the parsers, dictionaries, and stop words used by the 'english' configuration.

The position information stored in tsvector is what enables phrase searching. If you search for 'tuning <-> performance', it means "tuning" must immediately precede "performance".

SELECT id, title FROM articles
WHERE tsv @@ to_tsquery('english', 'tuning <-> performance');

This would match "tuning performance" but not "performance tuning".

The tsquery syntax is quite expressive. You can combine terms: 'PostgreSQL & (performance | tuning)' will find articles mentioning "PostgreSQL" AND either "performance" OR "tuning".

A common misconception is that FTS is just a simple LIKE operator with stemming. It’s much more. The GIN index on tsvector allows for very fast lookups of lexemes, and the tsquery processing ensures that queries are matched against the normalized document representation, not just raw text.

The tsvector itself can be generated on the fly during queries, but it’s usually far more efficient to store it in a column and index it. This is especially true for large datasets or frequent searches.

One thing that trips people up is how different languages are handled. The 'english' configuration is specific. If your documents are in French, you’d use 'french', which has its own set of stop words and stemming rules. Using the wrong configuration leads to poor search results, missing relevant documents or returning irrelevant ones. For example, 'voiture' (car in French) might be stemmed differently or not recognized as a stop word by the English configuration, leading to it appearing in search results when it shouldn’t.

The next step is often dealing with different languages within the same dataset or optimizing for very specific search requirements using custom text search configurations.