Digitize your notes with the Azure Computer Vision READ API

This article refers to a previous version of the Azure Computer Vision service. Check out my post Extract text from images with Azure Computer Vision 4.0 Read OCR for the latest updates.

The Computer Vision service provides pre-built, advanced algorithms that process and analyze images and extract text from photos and documents (Optical Character Recognition, OCR). In the previous article, we explored the built-in image analysis capabilities of Azure Computer Vision.

In this article, we will explore the pre-trained models of Azure Computer Vision service for optical character recognition. We will build a simple Python notebook that turns your handwritten documents into digital notes. You will learn how to:

  • Provision a Computer Vision resource.
  • Use a Computer Vision resource to extract text from photos.

To complete the exercise, you will need to install:

  • Python 3,
  • Visual Studio Code,
  • Jupyter Notebook and Jupyter Extension for Visual Studio Code.

What is Optical Character Recognition?

The Azure Computer Vision OCR service can extract printed and handwritten text from photos and documents. The READ API uses the latest optical character recognition models and works asynchronously. This means that the READ operation requires a three-step process:

  • Submit an image to the Computer Vision service.
  • Wait for the analysis operation to complete.
  • Retrieve the results of the analysis.

Study the following sketch note to learn more about Optical Character Recognition with the Azure Computer Vision READ API.

Overview of the READ API sketch note
Azure Computer Vision: Overview of the READ API

You can find more information and how-to-guides about Computer Vision and Optical Character Recognition on Microsoft Learn and Microsoft Docs.

Create a Computer Vision Resource

To use the Computer Vision service, you can either create a Computer Vision resource or a Cognitive Services resource. If you plan to use Computer Vision along with other cognitive services, such as Text Analytics, you can create a Cognitive Services resource, or else you can create a Computer Vision resource.

In this exercise, you will create a Computer Vision resource.

  1. Sign in to Azure Portal and select Create a resource.

    Create a resource in Azure portal
  2. Search for Computer Vision and then click Create.

    Search for Computer Vision
  3. Create a Computer Vision resource with the following settings:

    • Subscription: Your Azure subscription.
    • Resource group: Select an existing resource group or create a new one.
    • Region: Choose any available region, for example North Europe.
    • Name: This would be your custom domain name in your endpoint. Enter a unique name.
    • Pricing tier: You can use the free pricing tier (F0) to try the service, and upgrade later to a paid tier.
    Create a Computer Vision resource
  4. Select Review + Create and wait for deployment to complete.

  5. Once the deployment is complete, select Go to resource. On the Overview tab, click Manage keys. Save the Key 1 and the Endpoint. You will need the key and the endpoint to connect to your Computer Vision resource from client applications.

    Manage Keys and Endpoint

Install the Computer Vision library

Install the Azure Cognitive Services Computer Vision SDK for Python package with pip:

1
pip install azure-cognitiveservices-vision-computervision

Create a new Python Notebook

Create a new Jupyter Notebook, for example image-analysis-demo.ipynb and open it in Visual Studio Code or in your preferred editor.

Want to view the whole notebook at once? You can find it on GitHub.

  1. Import the following libraries.

    1
    2
    3
    4
    5
    6
    7
    8
    
    from azure.cognitiveservices.vision.computervision import ComputerVisionClient
    from msrest.authentication import CognitiveServicesCredentials
    from azure.cognitiveservices.vision.computervision.models import OperationStatusCodes
    from PIL import Image
    import matplotlib.pyplot as plt
    import matplotlib.patches as patches
    import time
    import numpy as np
    
  2. Then, create variables for your Computer Vision resource. Replace YOUR_KEY with Key 1 and YOUR_ENDPOINT with your Endpoint.

    1
    2
    
    key = 'YOUR_KEY'
    endpoint = 'YOUR_ENDPOINT'
    
  3. Authenticate the client. Create a ComputerVisionClient object with your key and endpoint.

    1
    
    computervision_client = ComputerVisionClient(endpoint, CognitiveServicesCredentials(key))
    

Extract handwritten text from photos

First download the images used in the following examples from my GitHub repository.

In the next cell of the notebook, add the following code, which submits the notes1.jpg image to the Computer Vision READ API, retrieves and prints the extracted text.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# Open local image file
image_path = "images/notes1.jpg"
image = open(image_path, "rb")
# Call the API
read_response = computervision_client.read_in_stream(image, raw=True)
# Get the operation location (URL with an ID at the end)
read_operation_location = read_response.headers["Operation-Location"]
# Grab the ID from the URL
operation_id = read_operation_location.split("/")[-1]
# Retrieve the results
while True:
    read_result = computervision_client.get_read_result(operation_id)
    if read_result.status not in ['notStarted', 'running']:
        break
    time.sleep(1)
# Print the detected text and bounding boxes
if read_result.status == OperationStatusCodes.succeeded:
    for text_result in read_result.analyze_result.read_results:
        for line in text_result.lines:
                print(line.text)
                print(line.bounding_box)

Display the bounding box of lines

The bounding_box is a quadrangle bounding box of a line specified as a list of 8 numbers. The (x, y) coordinates are specified relative to the top-left of the original image.

Add this code to the next cell to display the bounding box of every detected line.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# Open local image file
image_path = "images/notes1.jpg"
image = open(image_path, "rb")
img = Image.open(image_path)
# Call the API
read_response = computervision_client.read_in_stream(image, raw=True)
# Get the operation location (URL with an ID at the end)
read_operation_location = read_response.headers["Operation-Location"]
# Grab the ID from the URL
operation_id = read_operation_location.split("/")[-1]
# Retrieve the results
while True:
    read_result = computervision_client.get_read_result(operation_id)
    if read_result.status not in ['notStarted', 'running']:
        break
    time.sleep(1)
# Create figure and axes
fig, ax = plt.subplots()
# Display the image
ax.imshow(img)
# Print the detected text and bounding boxes
if read_result.status == OperationStatusCodes.succeeded:
    for text_result in read_result.analyze_result.read_results:
        for line in text_result.lines:
            # Print line
            print(line.text)
            # line.bounding_box contains 4 pairs of (x, y) coordinates
            xy1 = [line.bounding_box[0], line.bounding_box[1]]
            xy2 = [line.bounding_box[2], line.bounding_box[3]]
            xy3 = [line.bounding_box[4], line.bounding_box[5]]
            xy4 = [line.bounding_box[6], line.bounding_box[7]]
            box_coordinates = np.array([xy1, xy2, xy3, xy4])

            # Create a Rectangle patch
            rect = patches.Polygon(box_coordinates, closed=True, linewidth=2, edgecolor='r', facecolor='none')
            # Add the patch to the Axes
            ax.add_patch(rect)
plt.show()

Each word in a text line includes bounding box coordinates indicating its position on the image and a confidence value between 0 and 1 inclusive. The following code print the detected words along with their confidence score and display the bounding boxes.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# Open local image file
image_path = "images/notes1.jpg"
image = open(image_path, "rb")
img = Image.open(image_path)
# Call the API
read_response = computervision_client.read_in_stream(image, raw=True)
# Get the operation location (URL with an ID at the end)
read_operation_location = read_response.headers["Operation-Location"]
# Grab the ID from the URL
operation_id = read_operation_location.split("/")[-1]
# Retrieve the results
while True:
    read_result = computervision_client.get_read_result(operation_id)
    if read_result.status not in ['notStarted', 'running']:
        break
    time.sleep(1)
# Create figure and axes
fig, ax = plt.subplots()
# Display the image
ax.imshow(img)
# Print the detected text and bounding boxes
if read_result.status == OperationStatusCodes.succeeded:
    for text_result in read_result.analyze_result.read_results:
        for line in text_result.lines:
            # Print line
            print(line.text)
            # line.bounding_box contains 4 pairs of (x, y) coordinates
            xy1 = [line.bounding_box[0], line.bounding_box[1]]
            xy2 = [line.bounding_box[2], line.bounding_box[3]]
            xy3 = [line.bounding_box[4], line.bounding_box[5]]
            xy4 = [line.bounding_box[6], line.bounding_box[7]]
            box_coordinates = np.array([xy1, xy2, xy3, xy4])

            # Create a Rectangle patch
            box = patches.Polygon(box_coordinates, closed=True, linewidth=2, edgecolor='r', facecolor='none')
            # Add the patch to the Axes
            ax.add_patch(box)
            # Print words in line with confidence score
            for word in line.words:
                print(f"   * {word.text}: {word.confidence * 100:.2f}")

                xy1 = [word.bounding_box[0], word.bounding_box[1]]
                xy2 = [word.bounding_box[2], word.bounding_box[3]]
                xy3 = [word.bounding_box[4], word.bounding_box[5]]
                xy4 = [word.bounding_box[6], word.bounding_box[7]]
                box_coordinates = np.array([xy1, xy2, xy3, xy4])

                # Create a Rectangle patch
                box = patches.Polygon(box_coordinates, closed=True, linewidth=1, edgecolor='c', facecolor='none')
                # Add the patch to the Axes
                ax.add_patch(box)
plt.show()

Summary and next steps

In this article, you learned how to use Azure Computer Vision READ API to extract text from photos (Optical Character Recognition, OCR). For more information about using the Azure Cognitive Services Computer Vision SDK for Python package, see the computervision Package documentation.

Clean-up

If you have finished learning, you can delete the resource group from your Azure subscription:

  1. In the Azure Portal, select Resource groups on the right menu and then select the resource group that you have created.

  2. Click Delete resource group.

You May Also Like