PostgreSQL Full-text search using “text search vectors”
Postgres Vector Search
Full-text search in PostgreSQL is implemented using a concept called “text search vectors” (or just “tsvector”). Let’s dive into how it works:
- Text Search Vectors (tsvector):
- A tsvector is a sorted list of distinct lexemes, which are words that have been normalized to merge different forms of the same word (e.g., “run” and “running”).
- PostgreSQL provides functions to convert plain text into tsvector format, which typically involves:
- Parsing the text into tokens.
- Converting tokens to lexemes.
- Removing stop words (common words like “and” or “the” that are typically ignored in searches).
- Example: The text “The quick brown fox” might be represented in tsvector as ‘brown’:3 ‘fox’:4 ‘quick’:2.
- Text Search Queries (tsquery):
- A tsquery represents a text search query, which includes lexemes and optional operators.
- Operators can be used to combine lexemes in different ways (e.g., AND, OR, NOT).
- Example: The query “quick & fox” would match any tsvector containing both “quick” and “fox”.
- Searching:
- PostgreSQL provides the @@ operator to search a tsvector column with a tsquery.
- Example: WHERE column @@ to_tsquery(‘english’, ‘quick & fox’).
- Ranking:
- Once you’ve found matches using the @@ operator, you often want to rank them by relevance.
- PostgreSQL provides the ts_rank function to rank results. It returns a number indicating how relevant a tsvector is to a tsquery.
- The ranking is based on various factors, including the frequency of lexemes and their proximity to each other in the text.
- Indexes:
- One of the significant advantages of tsvector is that you can create a GiST or GIN index on it.
- These indexes significantly speed up full-text search queries.
- GIN indexes, in particular, are optimized for tsvector and provide very fast lookups.
- Normalization and Configuration:
- PostgreSQL supports multiple configurations (e.g., “english”, “french”) that determine how text is tokenized and which stop words are used.
- This allows you to tailor your full-text search to specific languages or requirements.
- Highlighting and Snippets:
- In addition to just searching, PostgreSQL provides functions like ts_headline to return snippets of the original text with search terms highlighted.
In summary, PostgreSQL’s full-text search works by converting regular text into a normalized format (tsvector) that is optimized for searching. This combined with powerful query capabilities (tsquery) and indexing options makes it a robust solution for many full-text search needs.
Implementing vector search using E.F Core and Postgres SQL
here are the steps to implement vector search in your dot net project:
Step 1: Add the required nuget packages
<PackageReference Include="Npgsql.EntityFrameworkCore.PostgreSQL" Version="7.0.11" /> <PackageReference Include="Npgsql.EntityFrameworkCore.PostgreSQL.Design" Version="1.1.0" /> <PackageReference Include="Npgsql.EntityFrameworkCore.PostgreSQL.NetTopologySuite" Version="7.0.11" />
Step 2: Implement a vector in your entities by implementing properties of type NpgsqlTsVector as shown below
public class Blog { public int Id { get; set; } public string Title { get; set; } public NpgsqlTsVector SearchVector { get; set; } }
Step 3: add a computed column in your DbContext
protected override void OnModelCreating(ModelBuilder modelBuilder) { modelBuilder.Entity<Blog>() .Property(b => b.SearchVector) .HasComputedColumnSql("to_tsvector('english', \"Blogs\".\"Title\")", stored: true); }
in this case you are calculating the vector using the value of the title column of the blogs table, you can calculate the vector using a single column or a combination of columns
Now you are ready to use vector search in your queries, please check the example below
var searchTerm = "Jungle"; // Example search term var searchVector = NpgsqlTsVector.Parse(searchTerm); var blogs = context.Blogs .Where(p => p.SearchVector.Matches(searchTerm)) .OrderByDescending(td => td.SearchVector.Rank(EF.Functions.ToTsQuery(searchTerm))).ToList();
In real world scenarios its better to create a vector by joining the values of several columns and weight them according to the relevance for your business case, you can check the test project I have created here : https://github.com/egarim/PostgresVectorSearch
and that’s it for this post, until next time, happy coding ))