Use the Azure AI Vision multi-modal embeddings API for image retrieval

Welcome to a new learning series about image similarity search with pgvector, an open-source vector similarity search extension for PostgreSQL databases. I find vector search an intriguing technology, and I’ve decided to explore it! Throughout this series, I will discuss the basic concepts of vector search, introduce you to the multi-modal embeddings API of Azure AI Vision, and guide you in building an image similarity search application using Azure Cosmos DB for PostgreSQL.

Introduction

Conventional search systems rely on exact matches on properties like keywords, tags, or other metadata, lexical similarity, or the frequency of word occurrences to retrieve similar items. Recently, vector similarity search has transformed the search process. It leverages machine learning to capture the meaning of data, allowing you to find similar items based on their content. The key idea behind vector search involves converting unstructured data, such as text, images, videos, and audio, into high-dimensional vectors (also known as embeddings) and applying nearest neighbor algorithms to find similar data.

In this tutorial, you learn how to:

  • Describe vector embeddings and vector similarity search.
  • Use the multi-modal embeddings API of Azure AI Vision for generating vectors for images and text.

Prerequisites

To proceed with this tutorial, ensure that you have the following prerequisites installed and configured:

  1. An Azure subscription - Create an Azure free account or an Azure for Students account.
  2. Python 3.x, Visual Studio Code, Jupyter Notebook, and Jupyter Extension for Visual Studio Code.

Concepts

Vector embeddings

Comparing unstructured data is challenging, in contrast to numerical and structured data, which can be easily compared by performing mathematical operations. What if we could convert unstructured data, such as text and images, into a numerical representation? We could then calculate their similarity using standard mathematical methods.

These numerical representations are called vector embeddings. An embedding is a high-dimensional and dense vector that summarizes the information contained in the original data. Vector embeddings can be computed using machine learning algorithms that capture the meaning of the data, recognize patterns, and identify similarities between the data.

Visualization of word embeddings in a 2-dimensional vector space. Words that are semantically similar are located close together, while dissimilar words are placed farther apart.
Visualization of word embeddings in a 2-dimensional vector space. Words that are semantically similar are located close together, while dissimilar words are placed farther apart.

Vector similarity

The numerical distance between two embeddings, or equivalently, their proximity in the vector space, represents their similarity. Vector similarity is commonly calculated using distance metrics such as Euclidean distance, inner product, or cosine distance.

Cosine is the similarity metric used by Azure AI Vision. This metric measures the angle between two vectors and is not affected by their magnitudes. Mathematically, cosine similarity is defined as the cosine of the angle between two vectors, which is equal to the dot product of the vectors divided by the product of their magnitudes.

Vector similarity can be used in various industry applications, including recommender systems, fraud detection, text classification, and image recognition. For example, systems can use vector similarities between products to identify similar products and create recommendations based on a user’s preferences.

A vector search system works by comparing the vector embedding of a user’s query with a set of pre-stored vector embeddings to find a list of vectors that are the most similar to the query vector. The diagram below illustrates this workflow.

Overview of vector similarity search flow.

Vector embeddings are usually stored in a vector database, which is a specialized type of database that is optimized for storing and querying vectors with a large number of dimensions. You will learn more about vector databases in one of the following posts in this series.

Create vector embeddings with Azure AI Vision

Azure AI Vision provides two APIs for vectorizing image and text queries: the Vectorize Image API and the Vectorize Text API. This vectorization converts images and text into coordinates in a 1024-dimensional vector space, enabling users to search a collection of images using text and/or images without the need for metadata, such as image tags, labels, or captions.

Let’s learn how the multi-modal embeddings APIs work.

Create an Azure AI Vision resource

  1. Open the Azure CLI.

  2. Create a resource group using the following command:

    1
    
    az group create --name your-group-name --location your-location
    
  3. Create an Azure AI Vision in the resource group that you have created using the following command:

    1
    
    az cognitiveservices account create --name ai-vision-resource-name --resource-group your-group-name --kind ComputerVision --sku S1 --location your-location --yes
    
The multi-modal embeddings APIs are available in the following regions: East US, France Central, Korea Central, North Europe, Southeast Asia, West Europe, West US.

Before using the multi-modal embeddings APIs, you need to store the key and the endpoint of your Azure AI Vision resource in an environment (.env) file.

Use the Vectorize Image API

Let’s review the following example. Given the filename of an image, the get_image_embedding function sends a POST API call to the retrieval:vectorizeImage API. The binary image data is included in the HTTP request body. The API call returns a JSON object containing the vector embedding of the image.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import os
from dotenv import load_dotenv
import requests

# Load environment variables
load_dotenv()
endpoint = os.getenv("VISION_ENDPOINT") + "computervision/"
key = os.getenv("VISION_KEY")

def get_image_embedding(image):
    with open(image, "rb") as img:
        data = img.read()

    # Vectorize Image API
    version = "?api-version=2023-02-01-preview&modelVersion=latest"
    vectorize_img_url = endpoint + "retrieval:vectorizeImage" + version

    headers = {
        "Content-type": "application/octet-stream",
        "Ocp-Apim-Subscription-Key": key
    }

    try:
        r = requests.post(vectorize_img_url, data=data, headers=headers)

        if r.status_code == 200:
            image_vector = r.json()["vector"]
            return image_vector
        else:
            print(f"An error occurred while processing {image}. Error code: {r.status_code}.")
        
    except Exception as e:
        print(f"An error occurred while processing {image}: {e}")

    return None

image_filename = "images/image (1).jpg"
image_vector = get_image_embedding(image_filename)
To vectorize a remote image, you would put the URL of the image in the request body.

Use the Vectorize Text API

Similarly to the example above, the get_text_embedding function sends a POST API call to the retrieval:vectorizeText API.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import json

def get_text_embedding(prompt):
    text = {'text': prompt}

    # Image retrieval API
    version = "?api-version=2023-02-01-preview&modelVersion=latest"
    vectorize_txt_url = endpoint + "retrieval:vectorizeText" + version

    headers = {
        'Content-type': 'application/json',
        'Ocp-Apim-Subscription-Key': key
    }

    try:
        r = requests.post(vectorize_txt_url, data=json.dumps(text), headers=headers)

        if r.status_code == 200:
            text_vector = r.json()['vector']
            return text_vector
        else:
            print(f"An error occurred while processing the prompt '{text}'. Error code: {r.status_code}.")
        
    except Exception as e:
        print(f"An error occurred while processing the prompt '{text}': {e}")
    
    return None

text_prompt = "a blue house"
text_vector = get_text_embedding(text_prompt)

Calculate image similarity

The following code calculates the cosine similarity between the vector of the image and the vector of the text prompt.

1
2
3
4
5
6
7
8
from numpy import dot
from numpy.linalg import norm

def get_cosine_similarity(vector1, vector2):
    return dot(vector1, vector2) / (norm(vector1) * norm(vector2))

similarity = get_cosine_similarity(image_vector, text_vector)
print(similarity)

Next steps

In this article, you’ve learned the basics of vector search and explored the multi-modal embeddings API of Azure AI Vision. In the next post, you will create vector embeddings for a collection of paintings’ images.

Here are some helpful learning resources:

You May Also Like