Use HNSW index on Azure Cosmos DB for PostgreSQL for similarity search

In the previous post, you explored the IVFFlat (Inverted File with Flat Compression) index for approximate nearest neighbor search on Azure Cosmos DB for PostgreSQL. You observed that the IVFFlat index provides accurate results with lower search times compared to exact nearest neighbor search. To find out more about previous posts, check out the links below:

The pgvector extension provides another indexing algorithm for approximate nearest neighbor search called Hierarchical Navigable Small World (HNSW) graphs. HNSW is one of the most popular and best-performing indexes for vector similarity search. HNSW index support was introduced in pgvector 0.5.0.

In this tutorial, you will:

Create an HNSW index in your Azure Cosmos DB for PostgreSQL table.
Write SQL queries to detect similar images based on a text prompt or a reference image, utilizing the HNSW index.
Investigate the execution plan of a similarity search query.

Prerequisites

To proceed with this tutorial, ensure that you have the following prerequisites installed and configured:

An Azure subscription - Create an Azure free account or an Azure for Students account.
Python 3.10, Visual Studio Code, Jupyter Notebook, and Jupyter Extension for Visual Studio Code.

Set-up your working environment

In this guide, you’ll learn how to query embeddings stored in an Azure Cosmos DB for PostgreSQL table to search for images similar to a search term or a reference image. The entire functional project is available in my GitHub repository. If want to follow along, just fork the repository and clone it to have it locally available.

Before running the Jupyter Notebook covered in this post, you should:

Create a virtual environment and activate it.
Install the required Python packages using the following command:
1
pip install -r requirements.txt
Create vector embeddings for a collection of images by running the scripts found in the data_processing directory.
Upload the images to your Azure Blob Storage container by executing the script found in the data_upload directory.

How the HNSW index works

The HNSW index is based on the construction of a multi-layered graph structure, that is optimized for performing approximate nearest neighbor search. In this graph structure, the datapoints (also referred to as nodes or vertices) are connected to each other by edges, which make it possible to navigate through the graph by following these edges.

The base layer of the multi-layer graph essentially represents the entire dataset, while the higher layers consist of fewer nodes, providing a simplified overview of the layers below. The higher layers contain longer links, allowing for longer jumps between nodes for faster search, while the lower layers contain shorter links, enabling more accurate search. Nearest neighbor search begins at the top layer, where the longest links are present. We then navigate to the nearest node and gradually move to lower layers until a local minimum is found. This process is illustrated in the following image:

The search process in an HNSW graph can be compared to the process of planning a trip between two cities. Much like how we start our journey with major roads and gradually transition to smaller ones as we approach our destination, the HNSW search process begins with longer links at the top layer and gradually moves to lower layers as we approach the desired data points.

The HNSW algorithm is based on two fundamentals techniques: the probability skip list and the navigable small world (NSW) graphs. A detailed explanation of the process of constructing an index and searching through the graph is beyond the scope of this article. For further information, refer to the resources provided at the end of this article.

Compared to the IVFFlat index, the HNSW index generally provides better query performance in terms of the tradeoff between recall and speed, but at the expense of higher build time and more memory usage. Additionally, it doesn’t require a training step to build the index. This means you can create an HNSW index even before any data is inserted into the table, unlike the IVFFlat index, which needs to be rebuilt when data changes to accurately represent new cluster centroids.

Create an HNSW index

The code for creating an HNSW index and inserting data into a PostgreSQL table can be found at data_upload/upload_data_to_postgresql_hnsw.py.

To create an HNSW index through the pgvector extension, three parameters need to be specified:

Distance: The pgvector extension provides 3 methods for calculating the distance between vectors: Euclidean (L2), inner product, and cosine. These methods are identified by vector_l2_ops, vector_ip_ops, and vector_cosine_ops, respectively. It is essential to select the same distance metric for both the creation and querying of the index.
m: The parameter m specifies the maximum number of connections with neighboring datapoints per point per layer. Its default value is 16.
ef_construction: The parameter ef_construction defines the size of list that holds the nearest neighbor candidates when building the index. The default value is 64.

To create an HNSW index in a PostgreSQL table, you can use the following statement:

1
CREATE INDEX ON <table_name> USING hnsw (<vector_column_name> <distance_method>) WITH (m = <m>, ef_construction = <ef_construction>);

Detect similar images using the pgvector HNSW index

The code for image similarity search with the pgvector extension can be found at vector_search_samples/image_search_hnsw_index.ipynb.

To search for similar images through the HNSW index of the pgvector extension, we can use SQL SELECT statements and the built-in distance operators. The structure of a SELECT statement was explained in the Exact Nearest Neighbor Search blog post. For approximate nearest neighbor search, an additional parameter needs to be considered to use the HNSW index.

Approximate nearest neighbor search using the HNSW index

The ef_search parameter specifies the size of the list that holds the nearest neighbor candidates during query execution. The default value is set to 40. The ef_search parameter can be altered (for a single query or a session) using the following command:

1
SET hnsw.ef_search = <value>;

To check whether PostgreSQL utilizes the index in a query, you can prefix the SELECT statement with the EXPLAIN ANALYZE keywords. An example of a query plan that utilizes the HNSW index is provided below:

1
2
3
4
5
Limit (cost=160.60..163.02 rows=12 width=72) (actual time=1.283..1.406 rows=12 loops=1)
  ->  Index Scan using paintings_hnsw_vector_idx on paintings (cost=160.60..2416.67 rows=11206 width=72) (actual time=1.281..1.403 rows=12 loops=1)
        Order By: (vector <=> '[0.001363333, ..., -0.0010466448]'::vector)
Planning Time: 0.183 ms
Execution Time: 1.439 ms

Additionally, it is important to note that pgvector only supports ascending-order index scans. This means that the following query does not utilize the HNSW index:

1
SELECT image_title, artist_name, 1 - (vector <=> '[0.003, …, 0.034]') AS cosine_similarity FROM paintings ORDER BY cosine_similarity DESC LIMIT 12;

One possible way to rewrite the SELECT statement to use the index is provided below:

1
SELECT image_title, artist_name, vector <=> '[0.003, …, 0.034]' AS cosine_distance FROM paintings ORDER BY cosine_distance LIMIT 12;

Code sample: Image similarity search with HNSW index

In the Jupyter Notebook provided on my GitHub repository, you’ll explore text-to-image and image-to-image search scenarios. You will use the same text prompts and reference images as in the Exact Nearest Neighbors search example, allowing for a comparison of the accuracy of the results.

Images retrieved by searching for paintings using the painting "Still Life with Flowers" by Charles Ginner as a reference. The HNSW index retrieved all the paintings obtained with exact search. — Images retrieved by searching for paintings using the painting “Still Life with Flowers” by Charles Ginner as a reference. The HNSW index retrieved all the paintings obtained with exact search.

Feel free to experiment with the notebook and modify the code to gain hands-on experience with the pgvector extension!

Next steps

The pgvector extension and PostgreSQL provide additional features that you can leverage to build AI-powered search applications. For example, you can integrate vector search with conventional keyword-based search methods into hybrid search systems, which generally have better performance.

If you want to learn more about the HNSW algorithm, check out these learning resources: