Use pgvector for searching images on Azure Cosmos DB for PostgreSQL

Welcome to the next part of the “Image similarity search with pgvector” learning series! In the previous articles, you used the multi-modal embeddings APIs of Azure AI Vision for generating embeddings for a collection of images of paintings and stored the embeddings in an Azure Cosmos DB for PostgreSQL table.

To find out more about previous posts, check out the links below:

Introduction

If you have followed the previous posts, you should have successfully created a table in your Azure Cosmos DB for PostgreSQL cluster, populated it with data, and uploaded the images to a container in your Azure Storage account. Now, you are fully prepared to search for similar images utilizing the vector similarity search features of the pgvector extension.

In this tutorial, you will:

Develop an image similarity search app using Jupyter Notebook.
Write SQL queries to detect similar images based on a text prompt or a reference image.
Apply a simple metadata filtering method to narrow down search results.

Prerequisites

To proceed with this tutorial, ensure that you have the following prerequisites installed and configured:

An Azure subscription - Create an Azure free account or an Azure for Students account.
Python 3.10, Visual Studio Code, Jupyter Notebook, and Jupyter Extension for Visual Studio Code.

Set-up your working environment

In this guide, you’ll learn how to query embeddings stored in an Azure Cosmos DB for PostgreSQL table to search for images similar to a search term or a reference image. The entire functional project is available in the GitHub repository. If you’re keen on trying it out, just fork the repository and clone it to have it locally available.

Before running the Jupyter Notebook covered in this post, you should:

Create a virtual environment and activate it.
Install the required Python packages using the following command:
1
pip install -r requirements.txt
Create vector embeddings for a collection of images by running the scripts found in the data_processing directory.
Upload the images to your Azure Blob Storage container, create a PostgreSQL table, and populate it with data by executing the scripts found in the data_upload directory.

Detect similar images using the pgvector extension

The code for image similarity search with the pgvector extension can be found at vector_search_samples/image_search.ipynb.

The image similarity search workflow that we will follow is summarized as follows:

Use the Azure AI Vision Vectorize Image API or the Vectorize Text API to generate the vector embedding of a reference image or text prompt, respectively. It is crucial to employ the same embedding model for queries as the one used to generate embeddings for the images in the dataset.
To calculate similarity and retrieve images, utilize SQL SELECT statements and the built-in vector operators of the PostgreSQL database. Specifically, cosine similarity will be used as the similarity metric.
The similarity search will produce a list of vectors that are most similar to the query vector. The raw data associated with each vector can then be accessed.
Download the images from the Azure Blob Storage container and display them using the matplotlib package.

This workflow is illustrated in the following diagram:

Nearest neighbor search using pgvector

Given the vector embedding of the query, we can use SQL SELECT statements to search for similar images. Let’s understand how a simple SELECT statement works. Consider the following query:

1
SELECT * FROM table ORDER BY vector <=> '[0.003, …, 0.034]' LIMIT 5

This query computes the cosine distance (<=>) between the given vector ([0.003, …, 0.034]) and the vectors stored in the table, sorts the results by the calculated distance, and returns the five most similar images (LIMIT 5). Additionally, you can obtain the cosine similarity between the query vector and the retrieved vectors by modifying the SELECT statement as follows:

1
SELECT image_title, 1 - (vector <=> '[0.003, …, 0.034]') AS cosine_similarity FROM table ORDER BY cosine_similarity DESC LIMIT 5;

The pgvector extension provides 3 operators that can be used to calculate similarity:

Operator	Description
`<->`	Euclidean distance
`<#>`	Negative inner product
`<=>`	Cosine distance

Code sample: Image similarity search

In the Jupyter Notebook provided on my GitHub repository, you’ll explore the following scenarios:

Text-to-image search: You will use a text prompt to search for and identify paintings that are semantically similar, relying solely on the vector embeddings without utilizing image metadata, such as the title or description of the painting.
Image-to-image search: You will use a painting as a reference to search for similar ones by comparing the vector embedding of the reference image with those in the collection.
Metadata filtering: Filtering enables users to narrow down search results, such as searching for paintings by a specific artist. However, implementing accurate and fast metadata filtering in vector search systems is a challenging task. You can read the article The Missing WHERE Clause in Vector Search on the Pinecone blog to learn about the two fundamental approaches for metadata filtering and understand the complexities involved in implementing such filters into vector search applications.

Twelve paintings depicting flowers by Vincent van Gogh. These artworks were retrieved through a search using the text prompt 'flowers by Vincent van Gogh'. — Images retrieved by searching for paintings using the text prompt “flowers by Vincent van Gogh”.

Feel free to experiment with the notebook and modify the code to gain hands-on experience with the pgvector extension!

Next steps

In this post, you explored the basic vector similarity search features offered by the pgvector extension. This type of vector search is referred to as exact nearest neighbor search, as it computes the similarity between the query vector and every vector in the database. In the upcoming post, you will explore approximate nearest neighbor search, which trades off result quality for speed.

If you want to explore pgvector’s features, check out these learning resources: