Try Multimodal Search with ColQwen2!

Published at

1/4/2025

Library Installation

First, install the necessary libraries:

!pip install git+https://github.com/illuin-tech/colpali
!pip install pymupdf

Preparing Image Data

Next, prepare the image data. For this tutorial, we’ll use the ColPali paper.

Using pymupdf, we’ll extract images from the PDF file:

import pymupdf
import os

# Constants
DPI = 350  # Can be modified as needed

def convert_pdf_to_images(pdf_path, output_dir):
    """
    Convert PDF pages to images.
    Args:
        pdf_path (str): Path to the PDF file.
        output_dir (str): Directory to save images.
    """
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    pdf_document = pymupdf.open(pdf_path)

    for page_number in range(pdf_document.page_count):
        page = pdf_document[page_number]
        pix = page.get_pixmap(dpi=DPI)
        output_file = os.path.join(output_dir, f'page_{page_number + 1:02}.png')
        pix.save(output_file)

    pdf_document.close()

pdf_path = "/content/2407.01449v3.pdf"
output_dir = "output_images"
convert_pdf_to_images(pdf_path, output_dir)

Images will be saved in the "output_images" folder.

Searching the Images

Now, let’s use ColQwen2. Refer to the Huggingface page for sample code.

After downloading and uploading the paper PDF to Google Colab, execute the following code:

import glob, os

import torch
from PIL import Image

from colpali_engine.models import ColQwen2, ColQwen2Processor

device = "cuda:0" if torch.cuda.is_available() else "cpu"

print(f"cuda available: {torch.cuda.is_available()}")

model = ColQwen2.from_pretrained(
        "vidore/colqwen2-v0.1",
        torch_dtype=torch.bfloat16,
        device_map=device,  # or "mps" if on Apple Silicon
    ).eval()
processor = ColQwen2Processor.from_pretrained("vidore/colqwen2-v0.1")

# Your inputs
images = [Image.open(filepath) for filepath in glob.glob(os.path.join(output_dir, "*.png"))]
queries = [
    "What is the architecture of ColPali?",
    "How does it differ from previous studies?",
]

# Process the inputs
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

# Forward pass
batch_size = 1  # Reduced batch size to 1
image_embeddings = []
for i in range(0, len(images), batch_size):
    batch = images[i : i + batch_size]
    resized_batch = [img.resize((512, 512)) for img in batch]  # Resize before processing
    batch_images = processor.process_images(resized_batch).to(model.device)
    with torch.no_grad():
        embeddings = model(**batch_images)
    image_embeddings.extend(embeddings)
with torch.no_grad():
    query_embeddings = model(**batch_queries)

image_embeddings = torch.stack(image_embeddings)

scores = processor.score_multi_vector(query_embeddings, image_embeddings)

print(scores)

The scores are returned as a list (matrix):

tensor([[13.2500,  8.4375, 11.3750, 11.1875, 13.8125, 12.0000,  8.3125,  9.0000,
         10.4375,  8.7500, 10.4375, 11.6250,  7.8438,  7.4375,  9.9375,  8.0625,
          7.5000, 10.9375,  9.7500,  7.8750],
        [ 8.3750,  7.5000,  9.6250,  8.3125,  7.5625,  8.1250,  7.9688,  8.4375,
          8.5000,  9.0625,  7.7812,  8.3125,  7.5000,  7.9062,  8.6875,  7.9688,
          7.9062,  7.9688,  8.7500,  7.5000]])

Visualizing Scores

Let’s visualize the scores:

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np


scores_df = pd.DataFrame(scores.cpu().numpy(), columns=[f'Image {i+1}' for i in range(scores.shape[1])]).T
scores_df.index.name = 'Images'

# Create two separate bar plots side-by-side
plt.figure(figsize=(12, 6))

# First bar plot
plt.subplot(1, 2, 1)
sns.barplot(x=scores_df.index, y=scores_df[0], color="skyblue")
plt.title("Query: " + queries[0])
plt.xticks(rotation=45, ha="right")
plt.ylabel('Score')

# Second bar plot
plt.subplot(1, 2, 2)
sns.barplot(x=scores_df.index, y=scores_df[1], color="lightcoral")
plt.title("Query: " + queries[1])
plt.xticks(rotation=45, ha="right")
plt.ylabel('Score')

plt.tight_layout()
plt.show()

Inspecting the Top Results

Let’s check the top 2 results:

for query, high_idx in zip(queries, highest_score_indices.tolist()):
    print(f"{query}: {high_idx}")
    # Display the image
    image_path = os.path.join(output_dir, f"page_{high_idx+1}.png")
    display(Image.open(image_path))

Query 1: What is the architecture of ColPali?

Top 2 results:

1st: Page 5

2nd: Page 1

The Page 5 with the highest relevance includes the word “Architecture.” However, the architecture diagram on page 2 received a lower score.

Query 2: How does it differ from previous studies?

Top 2 results:

1st: Page 3

2nd: Page 10

Page 3 has contents of “Related Work,” but the start of the related work section on page 2 scored lower. Page 10, which includes references, scored higher, as expected.

Conclusion

We tested image search using ColQwen2. Searching entire PDF pages proved challenging; for practical use, extracting figures as standalone images might improve results.

To extract text, images, and tables more effectively from PDFs, consider tools like pymupdf2llm.

dev-resources.site