Welcome to the third part of the “Image similarity search with pgvector” learning series! In the previous articles, you learned how to describe vector embeddings and vector similarity search. You also used the multi-modal embeddings APIs of Azure AI Vision for generating embeddings for a collection of images of paintings.
To find out more about previous posts, check out the links below:
In this learning series, we will create a search system that lets user provide a text description or a reference image to find similar paintings. We have already generated vector embeddings for the images in our dataset using the multi-modal embeddings API of Azure AI Vision. In this post, we will use Azure Blob Storage to store the images and Azure Cosmos DB for PostgreSQL to store our vector embeddings using the pgvector extension. In the next tutorials, we will perform a similarity search on our embeddings. The workflow is illustrated in the following image:
In this tutorial, you will learn how to:
Upload images to an Azure Blob Storage container using the Python SDK.
Activate the pgvector extension on Azure Cosmos DB for PostgreSQL.
Store vector embeddings on an Azure Cosmos DB for PostgreSQL table.
Prerequisites
To proceed with this tutorial, ensure that you have the following prerequisites installed and configured:
In this guide, you’ll learn how to upload a collection of paintings’ images to an Azure Blob Storage container and insert vector embeddings into an Azure Cosmos DB for PostgreSQL table. The entire functional project is available in the GitHub repository. If you’re keen on trying it out, just fork the repository and clone it to have it locally available.
Before running the scripts covered in this post, you should:
Azure Blob Storage is a cloud storage service that is optimized for storing large amounts of unstructured data, such as images. It offers three types of resources:
The storage account that contains all your Azure Storage data objects. Every object that is stored in Azure Storage is identified by a unique address.
Containers in the storage account which are similar to directories in a file system.
Blobs that are organized in the containers.
The following diagram illustrates the relationship between these resources:
Create an Azure Storage account
Open the Azure CLI.
Create an Azure Storage Account using the following command:
The Azure Blob Storage client library for Python provides the following classes to manage blobs and containers:
BlobServiceClient: We will use the BlobServiceClient class to interact with the Azure Storage account and create a container.
ContainerClient: We will use the ContainerClient class to interact with our container and the blobs inside the container.
BlobClient: We will use the BlobClient class to upload a blob to our container.
The process of uploading our images to Azure Blob Storage can be summarized as follows:
Create a new container to store our images.
Retrieve the filenames of the images in the dataset.
Upload the images in the container, utilizing multiple threads via the ThreadPoolExecutor class. Additionally, use the tqdm library to display progress bars for better visualizing the image uploading process.
Let’s use the Azure portal to create an Azure Cosmos DB for PostgreSQL cluster.
Search for “Azure Cosmos DB for PostgreSQL” and then select +Create.
Fill out the information on the Basics tab:
Subscription: Select your subscription.
Resource group: Select your resource group.
Cluster name: Enter a name for your Azure Cosmos DB for PostgreSQL cluster.
Location: Choose your preferred region.
Scale: You can leave Scale as its default value or select the optimal number of nodes as well as compute, memory, and storage configuration. Burstable, 1 vCores / 2 GiB RAM, 32 GiB storage is sufficient for this demo.
PostgreSQL version: Choose a PostgreSQL version such as 15.
Database name: You can leave database name at its default value citus.
Administrator account: The admin username must be citus. Select a password that will be used for citus role to connect to the database.
On the Networking tab, select Allow public access from Azure services and resources within Azure to this cluster and create your preferred firewall rule.
Select Review + Create. Once the deployment is complete, navigate to your resource.
Activate the pgvector extension
The pgvector extension adds vector similarity search capabilities to your PostgreSQL database. To use the extension, you have to first create it in your database. You can install the extension, by connecting to your database and running the CREATE EXTENSION command from the psql command prompt:
1
SELECT CREATE_EXTENSION('vector');
The pgvector extension introduces a data type called VECTOR that can be used during the creation of a table to indicate that a column will hold vector embeddings. When creating the column, it’s essential to specify the dimension of the vectors. In our scenario, Azure AI Vision generates 1024-dimensional vectors.
Insert data into a PostgreSQL table
To insert data into an Azure Cosmos DB for PostgreSQL table, we will proceed as follows:
Create a table to store the filenames of the images, their embeddings, and their associated metadata. All information is saved in a CSV file, as presented in Part 2.
Insert the data from the CSV file into the table using the PostgreSQL COPY command.
importosimportpsycopg2frompsycopg2import pool
fromdotenvimport load_dotenv
# ConstantsIMAGE_FILE_COLUMN_NAME ="image_file"DESCRIPTION_COLUMN_NAME ="description"AUTHOR_COLUMN_NAME ="author"TITLE_COLUMN_NAME ="title"TECHNIQUE_COLUMN_NAME ="technique"TYPE_COLUMN_NAME ="type"TIMEFRAME_COLUMN_NAME ="timeframe"VECTOR_COLUMN_NAME ="vector"# Directoriescurrent_dir = os.path.dirname(os.path.realpath(__file__))
parent_dir = os.path.dirname(current_dir)
# Load environemt fileload_dotenv(os.path.join(parent_dir, ".env"), override=True)
# Azure CosmosDB for PostgreSQL credentialspostgres_host = os.getenv("POSTGRES_HOST")
postgres_database_name = os.getenv("POSTGRES_DB_NAME")
postgres_user = os.getenv("POSTGRES_USER")
postgres_password = os.getenv("POSTGRES_PASSWORD")
sslmode ="require"table_name = os.getenv("POSTGRES_TABLE_NAME")
postgres_connection_string = (
f"host={postgres_host} user={postgres_user} "f"dbname={postgres_database_name} "f"password={postgres_password} sslmode={sslmode}")
# Dataset's folderdataset_folder = os.path.join(parent_dir, "dataset")
dataset_filepath = os.path.join(dataset_folder, "dataset_embeddings.csv")
defmain():
postgresql_pool = psycopg2.pool.SimpleConnectionPool(1, 20, postgres_connection_string)
if (postgresql_pool):
print("Connection pool created successfully")
# Get a connection from the connection pool conn = postgresql_pool.getconn()
cursor = conn.cursor()
print("Creating a table...")
cursor.execute(f"DROP TABLE IF EXISTS {table_name};")
cursor.execute(
f"CREATE TABLE {table_name} ("f"{IMAGE_FILE_COLUMN_NAME} TEXT PRIMARY KEY,"f"{DESCRIPTION_COLUMN_NAME} TEXT NOT NULL,"f"{AUTHOR_COLUMN_NAME} TEXT NOT NULL,"f"{TITLE_COLUMN_NAME} TEXT NOT NULL,"f"{TECHNIQUE_COLUMN_NAME} TEXT,"f"{TYPE_COLUMN_NAME} TEXT,"f"{TIMEFRAME_COLUMN_NAME} TEXT,"f"{VECTOR_COLUMN_NAME} VECTOR(1024) NOT NULL);" )
print("Saving data to table...")
withopen(dataset_filepath) as csv_file:
cursor.copy_expert(
f"COPY {table_name} FROM STDIN WITH "f"(FORMAT csv, DELIMITER '\t', HEADER MATCH);",
csv_file
)
conn.commit()
# Fetch all rows from table cursor.execute(f"SELECT * FROM {table_name};")
rows = cursor.fetchall()
print(f"Number of records in the table: {len(rows)}")
# Close the connection cursor.close()
conn.close()
Next steps
In this post, you uploaded the paintings’ images into an Azure Blob Storage container, configured the Azure Cosmos DB for PostgreSQL database as a vector database using the pgvector extension, and inserted the data into a table. In the subsequent posts, you will leverage the pgvector extension to perform a similarity search on the embeddings.
If you want to explore pgvector’s features, check out these learning resources:
I am an Electrical and Computer Engineering student at Aristotle University of Thessaloniki, Greece, interested in AI, IoT, Wireless Communications, and Cloud Technologies, as well as in applications of technology in healthcare and education. I am also Microsoft AI MVP and Gold Microsoft Learn Student Ambassador.