PostgreSQL provides the necessary building blocks for you to combine and create your own search engine for full-text search. Let's see how far we can take it.
Written by
Tudor Golubenco
Published on
July 12, 2023
This is part 1 of a blog mini-series, in which we explore the full-text search functionality in PostgreSQL and investigate how much of the typical search engine functionality we can replicate. In part 2, we'll do a comparison between PostgreSQL's full-text search and Elasticsearch.
Hey there, you are on the Xata engineering blog. Xata is a serverless data platform that makes building on top of PostgreSQL and Elasticsearch really easy. Sign up today!
If you want to follow along and try out the sample queries (which we recommend; it's more fun that way), the code samples are executed against the Wikipedia Movie Plots data set from Kaggle. To import it, download the CSV file, then create this table:
CREATE TABLE movies(
ReleaseYear int,
Title text,
Origin text,
Director text,
Casting text,
Genre text,
WikiPage text,
Plot text);
And import the CSV file like this:
\COPY movies(ReleaseYear, Title, Origin, Director, Casting, Genre, WikiPage, Plot)
FROM 'wiki_movie_plots_deduped.csv' DELIMITER ',' CSV HEADER;
The dataset contains 34,000 movie titles and is about 81 MB in CSV format.
The Postgres approach to full-text search offers building blocks that you can combine to create your own search engine. This is quite flexible but it also means it generally feels lower-level compared to search engines like Elasticsearch, Typesense, or Meilisearch, for which full-text search is the primary use case.
The main building blocks, which we'll cover via examples, are:
tsvector
and tsquery
data types@@
to check if a tsquery
matches a tsvector
ts_rank
, ts_rank_cd
)tsvector
We'll start by looking at these building blocks and then we'll get into more advanced topics, covering relevancy boosters, typo-tolerance, and faceted search.
The tsvector
data type stores a sorted list of lexemes. A lexeme is a string, just like a token, but it has been normalized so that different forms of the same word are made. For example, normalization almost always includes folding upper-case letters to lower-case, and often involves removal of suffixes (such as s
or ing
in English). Here is an example, using the to_tsvector
function to parse an English phrase into a tsvector
.
SELECT * FROM unnest(to_tsvector('english',
'I''m going to make him an offer he can''t refuse. Refusing is not an option.'));
lexeme | positions | weights
--------+-----------+---------
go | {3} | {D}
m | {2} | {D}
make | {5} | {D}
offer | {8} | {D}
option | {17} | {D}
refus | {12,13} | {D,D}
(6 rows)
As you can see, stop words like “I”, “to” or “an” are removed, because they are too common to be useful for search. The words are normalized and reduced to their root (e.g. “refuse” and “Refusing” are both transformed into “refus”). The punctuation signs are ignored. For each word, the positions in the original phrase are recorded (e.g. “refus” is the 12th and the 13th word in the text) and the weights (which are useful for ranking and we'll discuss later).
In the example above, the transformation rules from words to lexemes are based on the english
search configuration. Running the same query with the simple
search configuration results in a tsvector
that includes all the words as they were found in the text:
SELECT * FROM unnest(to_tsvector('simple',
'I''m going to make him an offer he can''t refuse. Refusing is not an option.'));
lexeme | positions | weights
----------+-----------+---------
an | {7,16} | {D,D}
can | {10} | {D}
going | {3} | {D}
he | {9} | {D}
him | {6} | {D}
i | {1} | {D}
is | {14} | {D}
m | {2} | {D}
make | {5} | {D}
not | {15} | {D}
offer | {8} | {D}
option | {17} | {D}
refuse | {12} | {D}
refusing | {13} | {D}
t | {11} | {D}
to | {4} | {D}
(16 rows)
As you can see, “refuse” and “refusing” now result in different lexemes. The simple
configuration is particularly useful when you have columns that contain labels or tags.
PostgreSQL has built-in configurations for a pretty good set of languages. You can see the list by running:
SELECT cfgname FROM pg_ts_config;
Notably, however, there is no configuration for CJK (Chinese-Japanese-Korean), which is worth keeping in mind if you need to create a search query in those languages. While the simple
configuration should work in practice quite well for unsupported languages, I'm not sure if that is enough for CJK.
The tsquery
data type is used to represent a normalized query. A tsquery
contains search terms, which must be already-normalized lexemes, and may combine multiple terms using AND, OR, NOT, and FOLLOWED BY operators. There are functions like to_tsquery
, plainto_tsquery
, and websearch_to_tsquery
that are helpful in converting user-written text into a proper tsquery
, primarily by normalizing words appearing in the text.
To get a feeling of tsquery
, let's see a few examples using websearch_to_tsquery
:
SELECT websearch_to_tsquery('english', 'the darth vader');
websearch_to_tsquery
----------------------
'darth' & 'vader'
That is a logical AND, meaning that the document needs to contain both "darth" and "vader" in order to match. You can do logical OR as well:
SELECT websearch_to_tsquery('english', 'darth OR vader');
websearch_to_tsquery
----------------------
'darth' | 'vader'
And you can exclude words:
SELECT websearch_to_tsquery('english', 'darth vader -wars');
websearch_to_tsquery
---------------------------
'darth' & 'vader' & !'war'
Also, you can represent phrase searches:
SELECT websearch_to_tsquery('english', '"the darth vader son"');
websearch_to_tsquery
------------------------------
'darth' <-> 'vader' <-> 'son'
This means: “darth”, followed by “vader”, followed by “son”.
Note, however, that the “the” word is ignored, because it's a stop word as per the english
search configuration. This can be an issue on phrases like this:
SELECT websearch_to_tsquery('english', '"do or do not, there is no try"');
websearch_to_tsquery
----------------------
'tri'
(1 row)
Oops, almost the entire phrase was lost. Using the simple
config gives the expected result:
SELECT websearch_to_tsquery('simple', '"do or do not, there is no try"');
websearch_to_tsquery
--------------------------------------------------------------------------
'do' <-> 'or' <-> 'do' <-> 'not' <-> 'there' <-> 'is' <-> 'no' <-> 'try'
You can check whether a tsquery
matches a tsvector
by using the match operator @@
.
SELECT websearch_to_tsquery('english', 'darth vader') @@
to_tsvector('english',
'Darth Vader is my father.');
?column?
----------
t
While the following example doesn't match:
SELECT websearch_to_tsquery('english', 'darth vader -father') @@
to_tsvector('english',
'Darth Vader is my father.');
?column?
----------
f
Now that we've seen tsvector
and tsquery
at work, let's look at another key building block: the GIN index type is what makes it fast. GIN stands for Generalized Inverted Index. GIN is designed for handling cases where the items to be indexed are composite values, and the queries to be handled by the index need to search for element values that appear within the composite items. This means that GIN can be used for more than just text search, notably for JSON querying.
You can create a GIN index on a set of columns, or you can first create a column of type tsvector
, to include all the searchable columns. Something like this:
ALTER TABLE movies ADD search tsvector GENERATED ALWAYS AS
(to_tsvector('english', Title) || ' ' ||
to_tsvector('english', Plot) || ' ' ||
to_tsvector('simple', Director) || ' ' ||
to_tsvector('simple', Genre) || ' ' ||
to_tsvector('simple', Origin) || ' ' ||
to_tsvector('simple', Casting)
) STORED;
And then create the actual index:
CREATE INDEX idx_search ON movies USING GIN(search);
You can now perform a simple test search like this:
SELECT title FROM movies WHERE search @@ websearch_to_tsquery('english','darth vader');
title
--------------------------------------------------
Star Wars Episode IV: A New Hope (aka Star Wars)
Return of the Jedi
Star Wars: Episode III – Revenge of the Sith
(3 rows)
To see the effects of the index, you can compare the timings of the above query with and without the index. The GIN index takes it from over 200 ms to about 4 ms on my computer.
So far, we've seen how ts_vector
and ts_query
can match search queries. However, for a good search experience, it is important to show the best results first - meaning that the results need to be sorted by relevancy.
Taking it directly from the docs:
PostgreSQL provides two predefined ranking functions, which take into account lexical, proximity, and structural information; that is, they consider how often the query terms appear in the document, how close together the terms are in the document, and how important is the part of the document where they occur. However, the concept of relevancy is vague and very application-specific. Different applications might require additional information for ranking, e.g., document modification time. The built-in ranking functions are only examples. You can write your own ranking functions and/or combine their results with additional factors to fit your specific needs.
The two ranking functions mentioned are ts_rank
and ts_rank_cd
. The difference between them is that while they both take into account the frequency of the term, ts_rank_cd
also takes into account the proximity of matching lexemes to each other.
To use them in a query, you can do something like this:
SELECT title,
ts_rank(search, websearch_to_tsquery('english', 'darth vader')) rank
FROM movies
WHERE search @@ websearch_to_tsquery('english','darth vader')
ORDER BY rank DESC
LIMIT 10;
title | rank
--------------------------------------------------+-------------
The Empire Strikes Back | 0.26263964
Star Wars Episode IV: A New Hope (aka Star Wars) | 0.18902963
Star Wars: Episode III – Revenge of the Sith | 0.10292397
Rogue One: A Star Wars Story (film) | 0.10049681
Return of the Jedi | 0.09910346
American Honey | 0.09910322
One thing to note about ts_rank
is that it needs to access the search
column for each result. This means that if the WHERE
condition matches a lot of rows, PostgreSQL needs to visit them all in order to do the ranking, and that can be slow. To exemplify, the above query returns in 5-7 ms on my computer. If I modify the query to do search for darth OR vader
, it returns in about 80 ms, because there are now over 1000 matching result that need ranking and sorting.
While relevancy based on word frequency is a good default for the search sorting, quite often the data contains important indicators that are more relevant than simply the frequency.
Here are some examples for a movies dataset:
This is why dedicated search engines typically offer ways to use different columns or fields to influence the ranking. Here are example tuning guides from Elastic, Typesense, and Meilisearch.
If you want a visual demo of the impact of relevancy tuning, here is a quick 4 minutes video about it:
While Postgres doesn't have direct support for boosting based on other columns, the rank is ultimately just a sort expression, so you can add your own signals to it.
For example, if you want to add a boost for the number of votes, you can do something like this:
SELECT title,
ts_rank(search, websearch_to_tsquery('english', 'jedi'))
-- numeric booster example
+ log(NumberOfVotes)*0.01
FROM movies
WHERE search @@ websearch_to_tsquery('english','jedi')
ORDER BY rank DESC LIMIT 10;
The logarithm is there to smoothen the impact, and the 0.01 factor brings the booster to a comparable scale with the ranking score.
You can also design more complex boosters, for example, boost by the rating, but only if the ranking has a certain number of votes. To do this, you can create a function like this:
create function numericBooster(rating numeric, votes numeric, voteThreshold numeric)
returns numeric as $$
select case when votes < voteThreshold then 0 else rating end;
$$ language sql;
And use it like this:
SELECT title,
ts_rank(search, websearch_to_tsquery('english', 'jedi'))
-- numeric booster example
+ numericBooster(Rating, NumberOfVotes, 100)*0.005
FROM movies
WHERE search @@ websearch_to_tsquery('english','jedi')
ORDER BY rank DESC LIMIT 10;
Let's take another example. Say we want to boost the ranking of comedies. You can create a valueBooster
function that looks like this:
create function valueBooster (col text, val text, factor integer)
returns integer as $$
select case when col = val then factor else 0 end;
$$ language sql;
The function returns a factor if the column matches a particular value and 0 instead. Use it in a query like this:
SELECT title, genre,
ts_rank(search, websearch_to_tsquery('english', 'jedi'))
-- value booster example
+ valueBooster(Genre, 'comedy', 0.05) rank
FROM movies
WHERE search @@ websearch_to_tsquery('english','jedi') ORDER BY rank DESC LIMIT 10;
title | genre | rank
--------------------------------------------------+------------------------------------+---------------------
The Men Who Stare at Goats | comedy | 0.1107927106320858
Clerks | comedy | 0.1107927106320858
Star Wars: The Clone Wars | animation | 0.09513916820287704
Star Wars: Episode I – The Phantom Menace 3D | sci-fi | 0.09471701085567474
Star Wars: Episode I – The Phantom Menace | space opera | 0.09471701085567474
Star Wars: Episode II – Attack of the Clones | science fiction | 0.09285612404346466
Star Wars: Episode III – Revenge of the Sith | science fiction, action | 0.09285612404346466
Star Wars: The Last Jedi | action, adventure, fantasy, sci-fi | 0.0889768898487091
Return of the Jedi | science fiction | 0.07599088549613953
Star Wars Episode IV: A New Hope (aka Star Wars) | science fiction | 0.07599088549613953
(10 rows)
Remember when we talked about the tsvector
lexemes and that they can have weights attached? Postgres supports 4 weights, named A, B, C, and D. A is the biggest weight while D is the lowest and the default. You can control the weights via the setweight
function which you would typically call when building the tsvector
column:
ALTER TABLE movies ADD search tsvector GENERATED ALWAYS AS
(setweight(to_tsvector('english', Title), 'A') || ' ' ||
to_tsvector('english', Plot) || ' ' ||
to_tsvector('simple', Director) || ' ' ||
to_tsvector('simple', Genre) || ' ' ||
to_tsvector('simple', Origin) || ' ' ||
to_tsvector('simple', Casting)
) STORED;
Let's see the effects of this. Without setweight
, a search for jedi
returns:
SELECT title, ts_rank(search, websearch_to_tsquery('english', 'jedi')) rank
FROM movies
WHERE search @@ websearch_to_tsquery('english','jedi')
ORDER BY rank DESC;
title | rank
--------------------------------------------------+-------------
Star Wars: The Clone Wars | 0.09513917
Star Wars: Episode I – The Phantom Menace | 0.09471701
Star Wars: Episode I – The Phantom Menace 3D | 0.09471701
Star Wars: Episode III – Revenge of the Sith | 0.092856124
Star Wars: Episode II – Attack of the Clones | 0.092856124
Star Wars: The Last Jedi | 0.08897689
Return of the Jedi | 0.075990885
Star Wars Episode IV: A New Hope (aka Star Wars) | 0.075990885
Clerks | 0.06079271
The Empire Strikes Back | 0.06079271
The Men Who Stare at Goats | 0.06079271
How to Deal | 0.06079271
(12 rows)
And with the setweight
on the title column:
SELECT title, ts_rank(search, websearch_to_tsquery('english', 'jedi')) rank
FROM movies
WHERE search @@ websearch_to_tsquery('english','jedi')
ORDER BY rank DESC;
title | rank
--------------------------------------------------+-------------
Star Wars: The Last Jedi | 0.6361112
Return of the Jedi | 0.6231253
Star Wars: The Clone Wars | 0.09513917
Star Wars: Episode I – The Phantom Menace | 0.09471701
Star Wars: Episode I – The Phantom Menace 3D | 0.09471701
Star Wars: Episode III – Revenge of the Sith | 0.092856124
Star Wars: Episode II – Attack of the Clones | 0.092856124
Star Wars Episode IV: A New Hope (aka Star Wars) | 0.075990885
The Empire Strikes Back | 0.06079271
Clerks | 0.06079271
The Men Who Stare at Goats | 0.06079271
How to Deal | 0.06079271
(12 rows)
Note how the movie titles with “jedi” in their name have jumped to the top of the list, and their rank has increased.
It's worth pointing out that having only four weight “classes” is somewhat limiting, and that they need to be applied when computing the tsvector
.
PostgreSQL doesn't support fuzzy search or typo-tolerance directly, when using tsvector
and tsquery
. However, working on the assumptions that the typo is in the query part, we can implement the following idea:
Here is how it works. First, use ts_stats
to get all words in a materialized view:
CREATE MATERIALIZED VIEW unique_lexeme AS
SELECT word FROM ts_stat('SELECT search FROM movies');
Now, for each word in the query, check if it is in the unique_lexeme
view. If it's not, do a fuzzy-search in that view to find possible misspellings of it:
SELECT * FROM unique_lexeme
WHERE levenshtein_less_equal(word, 'pregant', 2) < 2;
word
----------
premant
pregrant
pregnant
paegant
In the above we use the Levenshtein distance because that's what search engines like Elasticsearch use for fuzzy search.
Once you have the candidate list of words, you need to adjust the query include them all.
Faceted search is popular especially on e-commerce sites because it helps customers to iteratively narrow their search. Here is an example from amazon.com:
The above can implemented by manually defining categories and then adding them as WHERE
conditions to the search. Another approach is to create the categories algorithmically based on the existing data. For example, you can use the following to create a “Decade” facet:
SELECT ReleaseYear/10*10 decade, count(Title) cnt FROM movies
WHERE search @@ websearch_to_tsquery('english','star wars')
GROUP BY decade ORDER BY cnt DESC;
decade | cnt
--------+-----
2000 | 39
2010 | 31
1990 | 29
1950 | 28
1940 | 26
1980 | 22
1930 | 13
1960 | 11
1970 | 7
1910 | 3
1920 | 3
(11 rows)
This also provides counts of matches for each decade, which you can display in brackets.
If you want to get multiple facets in a single query, you can combine them, for example, by using CTEs:
WITH releaseYearFacets AS (
SELECT 'Decade' facet, (ReleaseYear/10*10)::text val, count(Title) cnt
FROM movies
WHERE search @@ websearch_to_tsquery('english','star wars')
GROUP BY val ORDER BY cnt DESC),
genreFacets AS (
SELECT 'Genre' facet, Genre val, count(Title) cnt FROM movies
WHERE search @@ websearch_to_tsquery('english','star wars')
GROUP BY val ORDER BY cnt DESC LIMIT 5)
SELECT * FROM releaseYearFacets UNION SELECT * FROM genreFacets;
facet | val | cnt
--------+---------+-----
Decade | 1910 | 3
Decade | 1920 | 3
Decade | 1930 | 13
Decade | 1940 | 26
Decade | 1950 | 28
Decade | 1960 | 11
Decade | 1970 | 7
Decade | 1980 | 22
Decade | 1990 | 29
Decade | 2000 | 39
Decade | 2010 | 31
Genre | comedy | 21
Genre | drama | 35
Genre | musical | 9
Genre | unknown | 13
Genre | war | 15
(16 rows)
The above should work quite well on small to medium data sets, however it can become slow on very large data sets.
We've seen the PostgreSQL full-text search primitives, and how we can combine them to create a pretty advanced full-text search engine, which also happens to support things like joins and ACID transactions. In other words, it has functionality that the other search engines typically don't have.
There are more advanced search topics that would be worth covering in detail:
Each of these would be worth their own blog post (coming!), but by now you should have an intuitive feeling about them: they are quite possible using PostgreSQL, but they require you to do the work of combining the primitives and in some cases the performance might suffer on very large datasets.
In part 2, we'll make a detailed comparison with Elasticsearch, looking to answer the question on when is it worth it to implement search into PostgreSQL versus adding Elasticsearch to your infrastructure and syncing the data. If you want to be notified when this gets published, you can follow us on Twitter or join our Discord.
Xata provides the best free plan in the industry. It is production ready by default and doesn't pause or cool-down. Take your time to build your business and upgrade when you're ready to scale.
Copyright © 2024 Xatabase Inc.
All rights reserved.