Full Text Search combining PostgreSQL with JPA

February 06, 2023

You are building an application using Spring JPA and want to implement a fast yet flexible full text search feature. Solutions like Hibernate Search might offer necessary tools at first sight, but come with limitations when you want to modify your search query with additional conditions. Check out this rather smooth alternative of combining PostgreSQL FTS with the JPA framework.

Use Case

Let's quickly get onto the same page about our goal by drawing this simple use case: A user (with a role that restricts his access rights to certain data) aims to search for documents containing specific text within an application. The user does this probably by entering words into a search field and by selecting some other filter criteria provided by the UI, hoping to receive the matching data entries in response. In other words, a query needs to be performed that will select all the data that contain the strings defined by the user from a textural point of view, while also restricting the selected data according to the conditions determined by the user role and the filter criteria.

Search in PostgreSQL

PostgreSQL provides tools for FTS, that is, searching the database for data containing specific words and returning all matching entries in the order of their similarity to the query. While there are many ways how to implement FTS with PostgreSQL, we chose the technique of matching vectors of processed data. The very basic concept of this can be broken down into the following elements:

to_tsvector(): Input text (from our database) is parsed and converted to a tsvector that represents a searchable document.
to_tsquery(): Input text (from the user) is parsed and converted to a tsquery that represents the search query.
The match operator @@ allows to match a query vector against a document vector and return true or false.

to_tsvector() - Convert the input data

This is roughly how potential output data from our database is turned into a searchable vector:

Text data is being tokenized, meaning that the string elements are broken up into tokens. Through the process of parsing, types are assigned to each token. The tokens are then further processed according to their type: With reference of dictionaries, a token is identified e.g. as a stop word (which will not find its way into the vector) or a lexeme, which will then undergo the process of stemming (reducing it to its linguistic word stem) before entering the vector. Stored data like

title
The Hitchhiker's Guide to the Galaxy
...

when processed by the function

SELECT to_tsvector('english', 'The Hitchhiker''s Guide to the Galaxy');

will be converted to a vector containing the reduced lexemes plus their positions in the original document:

'galaxi':7 'guid':4 'hitchhik':2

tsvector is the data type which is used to store the parsed document. The vector is (in most cases) built from the list of normalized lexemes, leaving out all the textural "noise" of the original document - which makes it much leaner and more compact, thus quicker to query over.

to_tsquery() - Convert the search request

Naturally, we need a parsed version of our user's search input too, so that it can be smoothly matched with the document vector. The strings emitted by the user therefore also need to undergo the process of eliminating the noise and stemming the lexemes, resulting in a tsquery that combines the normalized search words with operators.

PostgreSQL offers different functions for converting user input to a tsquery, which differ in the way they tokenize the input and set the operators. Our go-to function is websearch_to_tsquery(). It basically treats user input in a way a contemporary search engine would do (which, we assume, has shaped the convention of how users enter their search input).

Imagine a user entering the search terms (terms??plural or sing??) 'hitchhiker galaxy guide', the function

SELECT websearch_to_tsquery('english', 'hitchhiker galaxy guide');

will return this tsquery:

'hitchhik' & 'galaxi' & 'guid'

@@ (Performing the matching)

When held against the tsvector,

SELECT to_tsvector('english', 'The Hitchhiker''s Guide to the Galaxy') @@
    websearch_to_tsquery('english', 'hitchhiker galaxy guide')

the search will successfully find a match (and return the boolean true).

Keeping it dynamic

As for the parameters in the function, note that we have not only passed on the text, but also defined a language reference, which corresponds to a dictionary. (By the way: 'english' is also the default value.) But what if our database entries have different language origins? Stemming in English will lead to other results than for instance stemming in German; therefore by relying on the default, we would risk a "wrong" parsing and misleading and inappropriate result.

Let's look at the different results depending on the choice of language dictionary. A parsing process based on German, like

SELECT to_tsvector('german', 'Die Reise ins All');

will result in

'all':4 'reis':2

with "All" and "Reise" being lexemes in German ("die" and "ins" being article and preposition, hence stop words that undergo elimination), whereas an English-based parsing like

SELECT to_tsvector('english', 'Die Reise ins All');

will result in

'die':1 'in':3 'reis':2

For the latter example, where parsing was based on English stemming rules, you can see that "all" does not make it into the vector, since the English dictionary identifies it as a stop word and drops it. A query like "reise & all" where the user is looking for German titles containing space ("All") would consequently be unsuccessful. You hence want to keep your query code dynamic enough to choose the dictionary parameter according to the language of your data entries.

By altering our data tables with?? ("alter sth WITH sth??") a language column, we can store information about the language for each entry. This will allow us to later pass on the values in this column to the query function. Pay attention that the datatype is declared as REGCONFIG in order for the function(for SQL??for JPA??) to later be able to recognize it as the correct parameter.

We are adding a language column to our table:

ALTER TABLE spacebooks
ADD IF NOT EXISTS lang REGCONFIG NOT NULL DEFAULT 'english';

If our table stores data from different languages, this can be referenced with a corresponding value in the language column for each row of data. (These should be updated accordingly.)

author (text)	title (text)	lang (regconfig)
Douglas Adams	The Hitchhiker's Guide to the Galaxy	english
Marvin Müller	Die Reise ins All	german

We will now replace the dictionary parameter with a variable and insert the values from the lang column. At this point we will add a generated search column containing the vector data and include all columns that shall be indexed. We will define a hierarchy of the columns regarding relevancy for search by setting weight:

ALTER TABLE spacebooks
ADD IF NOT EXISTS search TSVECTOR
GENERATED ALWAYS AS (setweight(to_tsvector(lang, title::TEXT), 'A') || ' ' ||
setweight(to_tsvector('simple', author::TEXT), 'B') :: TSVECTOR
) STORED;

author (text)	title (text)	lang (regconfig)	search (tsvector)
Douglas Adams	The Hitchhiker's Guide to the Galaxy	english	'adams':9B 'douglas':8B 'galaxi':7A 'guid':4A 'hitchhik':2A
Marvin Müller	Die Reise ins All	german	'all':4A 'marvin':5B 'müller':6B 'reis':2A

Note that for the author column, we have chosen the 'simple' dictionary, which treats its tokens with no parsing or modification whatsoever - this will pay justice to strings that should not undergo any chopping up, such as names.

To speed up our search, we add an index to our search column:

create index idx_search on spacebooks using GIN (search);

Now we have established a handy base for querying our data using FTS.

Putting it into Action

Let us now refactor our query using all the benefits JPA offers us. Take a look at the following PostgreSQL query:

SELECT id,
       author,
       title,
       ts_rank(search, websearch_to_tsquery(lang, 'hitchhiker guide')) +
       ts_rank(search, websearch_to_tsquery('simple', 'hitchhiker guide')) AS rank
FROM spacebooks
WHERE search @@ websearch_to_tsquery(lang, 'hitchhiker guide')
   OR search @@ websearch_to_tsquery('simple', 'hitchhiker guide')
ORDER BY rank DESC

We can decompose this query into reusable functions that we can then access from our backend.

First for the actual query function:

CREATE
OR REPLACE FUNCTION search_text(search TSVECTOR, lang REGCONFIG, term TEXT)
RETURNS BOOLEAN
AS
$$
SELECT search @@ websearch_to_tsquery(lang, term)
    OR
    search @@ websearch_to_tsquery('simple', term)
        $$ LANGUAGE SQL;

We are defining a function search_text() that takes the generated search vector, the language and the input term as its parameters. The function refers to the query as described above.

Also, let's define a function that returns a numeric value which defines the relevancy of the search results, so that in case of multiple matches, the results can be ranked accordingly.

CREATE
OR REPLACE FUNCTION search_rank(search TSVECTOR, lang REGCONFIG, term TEXT)
    RETURNS NUMERIC
AS
$$
SELECT ts_rank(search, websearch_to_tsquery(lang, term)) +
       ts_rank(search, websearch_to_tsquery('simple', term));
$$
LANGUAGE SQL;

Let's get back to another specification in our use case and point out the relevant consequences: A user has access to certain data only and should not be able to read from the entire database. We want the search operation to take this into consideration by passing on the information about the user's access rights. Our query function should take the necessary information as parameters. The user also wants to limit the search results by applying a filter provided in the UI, like entries from the year 2022 only. The specifics of this limitation should also be passed on to form the corresponding query. To sum this up, the query should be constructed from not only the search terms but also additional information (such as tenant role, filter fields, ...). We hence need to be able to build a query combining the full text search feature while performing a table join across our data based on WHERE and AND conditions.

SELECT spb.id, spb.title, search_rank(spb.search, spb.lang, 'hitchhiker guide') AS rank
FROM spacebooks spb LEFT JOIN user u ON spb.user_id = u.id
WHERE search_text(spb.search, spb.lang, 'hitchhiker guide')
AND u.name = 'Arthur'
ORDER BY rank desc;

We store the function in our database and call it from our Java xxxx(code??). By using the @Query annotation, Spring is able to retrieve the function and execute it.

public interface SpaceBookRepository {
    @Query("""
        SELECT spb FROM SpaceBook spb
        LEFT JOIN user u ON spb.user.id = u.id
        WHERE search_text(spb.search, spb.lang, :search)
            AND u.name = :userName
        ORDER BY search_rank(spb.search, spb.lang, :search) DESC
        """)
    List<SpaceBook> search(@Param("search") String search, @Param("userName") String userName);
}

We add the following configurations for integration into the JPA framework:

  jpa:
    open-in-view: false
    hibernate:
      ddl-auto: validate
    properties:
      hibernate:
        metadata_builder_contributor: de.bytethefrog.labs.invoice.config.SqlFunctionsMetadataBuilderContributor

We leave the path for the MetadataBuilderContributor:

public class SqlFunctionsMetadataBuilderContributor implements MetadataBuilderContributor {

    @Override
    public void contribute(MetadataBuilder metadataBuilder) {
        metadataBuilder.applySqlFunction(
                "search_rank",
                new StandardSQLFunction("search_rank", StandardBasicTypes.DOUBLE)
        );
        metadataBuilder.applySqlFunction(
                "search_text",
                new StandardSQLFunction("search_text", StandardBasicTypes.BOOLEAN)
        );
    }
}

Voilà. Latest PostgreSQL for a smooth, fast & flexible FTS.

Note that websearch_to_tsquery() does not allow for fuzzy search (a search input like 'hitchiker' will not lead to finding documents containing 'hitchhiker'). To handle this limitation, one might consider first evaluating the input and if need be correcting it (e.g. by suggesting more likely search terms to the user) before passing it on to the function that creates the tsquery.

Sources & Further Reading (or Watching):